parallelize clique expansion #764

knaaptime · 2024-08-08T19:04:53Z

on relatively large graphs (n > ~10k) with colocated points, the clique expansion function can get really slow. After cutting it short a few times, I think this for-loop is where things are getting stuck. If that's the bottleneck, I think it should be pretty straightforward to parallelize using something like joblib

ljwolf · 2024-08-08T21:24:08Z

My original implementation was pure pandas, based on join and explode with no loops. Would it make sense to revisit that?

knaaptime · 2024-08-08T21:29:15Z

yeah i was wondering about explode there. I presume that would be even faster?

ljwolf · 2024-08-08T21:43:44Z

I think so.

We moved away from it because (a) hashing the points to establish unique locations was expensive and (b) our policy on sorting was different, but I think we can still use pandas join and explode given the new function signature.

The earlier implementation, conceptually:

Built a series indexed on unique locations with values being a list of all the original indices at that location. This used a groupby on the geometries and agg(list) on the ids
Explode the aggregated series
join the exploded series back to the adjacency table, allowing join to handle the one to many expansion for edges entering/leaving cliques.
separately, built the all-pairs links within each clique using a join.
stack the within clique and beyond clique tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize clique expansion #764

parallelize clique expansion #764

knaaptime commented Aug 8, 2024

ljwolf commented Aug 8, 2024

knaaptime commented Aug 8, 2024

ljwolf commented Aug 8, 2024

parallelize clique expansion #764

parallelize clique expansion #764

Comments

knaaptime commented Aug 8, 2024

ljwolf commented Aug 8, 2024

knaaptime commented Aug 8, 2024

ljwolf commented Aug 8, 2024