I've got a function that takes a node id of a graph as input and calculate something in the graph(without altering the graph object), then it saves the results on the filesystem, my code looks like this:
...
# graph file is being loaded
g = loadGraph(gfile='data/graph.txt')
# list of nodeids is being loaded
nodeids = loadSeeds(sfile='data/seeds.txt')
import multiprocessing as mp
# parallel part of the code
print ("entering the parallel part ..")
num_workers = mp.cpu_count() # 4 on my machine
p = mp.Pool(num_workers)
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
p.map(_myParallelFunction, nodeids)
p.close()
...
The problem is when I load the graph into Python it takes lots of memory(about 2G, it's a big graph with thousands of nodes actually), but when it starts to go into the parallel part of the code(the parallel map function execution) it seems that every process is given a separate copy of g and I simply run out of memory on my machine(it's got 6G ram and 3G swap), so I wanted to see that is there a way to give each process the same copy of g so that only the memory to hold one copy of it would be required? any suggestions are appreciated and thanks in advance.
multiprocessing.Manager, which will allow all processes to use one shared graph that's hosted in theManagerprocess. However, that comes with a large performance penalty when you access the shared graph, so it may end up not improving performance versus the sequential approach.nodeIDsfrom being created. See this comment and the associated article