I'm trying to perform calculations on a large network object to perform some predictions on links appearing between nodes. I am able to do this in serial, but not in parallel using Pythons multiprocessing. The function never seems to return from the parallel implementation looking at my task manager it does not seem to take up a lot of memory or CPU-power either
def jaccard_serial_predictions(G):
"""
Create a ranked list of possible new links based on the Jaccard similarity,
defined as the intersection of nodes divided by the union of nodes
parameters
G: Directed or undirected nx graph
returns
list of linkbunches with the score as an attribute
"""
potential_edges = []
G_undirected = nx.Graph(G)
for non_edge in nx.non_edges(G_undirected):
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
continue
else:
s = (1.0*uv_int)/uv_un
potential_edges.append(non_edge + ({'score': s},))
return potential_edges
def jaccard_prediction(non_edge):
u = set(G.neighbors(non_edge[0]))
v = set(G.neighbors(non_edge[1]))
uv_un = len(u.union(v))
uv_int = len(u.intersection(v))
if uv_int == 0 or uv_un == 0:
return
else:
s = (1.0*uv_int)/uv_un
return non_edge + ({'score': s},)
def jaccard_mp_predictions(G):
"""
Create a ranked list of possible new links based on the Jaccard similarity,
defined as the intersection of nodes divided by the union of nodes
parameters
G: Directed or undirected nx graph
returns
list of linkbunches with the score as an attribute
"""
pool = mp.Pool(processes=4)
G_undirected = nx.Graph(G)
results = pool.map(jaccard_prediction, nx.non_edges(G_undirected))
return results
Calling jaccard_serial_predictions(G) with G being a graph of 95000000 potential edges returns in 4.5 minutes, but jaccard_mp_predictions(G)does not return even after running for half an hour.
G? Some global object which you intend to use from multiple processes? That won't work. Try this approach: changejaccard_serial_predictionsso that it only runsjaccard_predictionmultiple times (sequentially), providing it all needed data without shared objects. Once that works, trymultiprocessing. Also, google about differences between threads and processes - you are not using threads here.if not(uv_int == 0 or uv_un == 0): potential_edges.append(non_edge + ({'score': (1.0*uv_int)/uv_un},))Gis a Graph object which contains the necessary functions to return the neighborhoods of nodes. The way I understand themultiprocessingdocumentation the context is shared between processes and so it should be OK to read from global objects as long as you do not use them for sharing state. See docs, section 17.2.1.2 @memoselyk: The serial implementation only uses about 200 MB when running, so I expect the parallel implementation to be a bit lower than 800 MB. I don't think this is big enough to pose a problem.Gwas that if one process modifies it, others will not see the change. If it is just an immutable object which provides some functions for calculation, that should be fine.