How to save up memory while using Multiprocessing in Python?

Question

I've got a function that takes a node id of a graph as input and calculate something in the graph(without altering the graph object), then it saves the results on the filesystem, my code looks like this:

...
# graph file is being loaded
g = loadGraph(gfile='data/graph.txt')
# list of nodeids is being loaded
nodeids = loadSeeds(sfile='data/seeds.txt')

import multiprocessing as mp
# parallel part of the code
print ("entering the parallel part ..")
num_workers = mp.cpu_count() # 4 on my machine
p = mp.Pool(num_workers)
# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}
p.map(_myParallelFunction, nodeids)
p.close()
...

The problem is when I load the graph into Python it takes lots of memory(about 2G, it's a big graph with thousands of nodes actually), but when it starts to go into the parallel part of the code(the parallel map function execution) it seems that every process is given a separate copy of g and I simply run out of memory on my machine(it's got 6G ram and 3G swap), so I wanted to see that is there a way to give each process the same copy of g so that only the memory to hold one copy of it would be required? any suggestions are appreciated and thanks in advance.

I don't know enough about this to give you a solid answer, however one suggestion would be to divide the graph into smaller section and then use the processes to basically "divide and conquer". — James Mertz
– James Mertz, Commented May 29, 2015 at 20:14
You could put the graph into a custom multiprocessing.Manager, which will allow all processes to use one shared graph that's hosted in the Manager process. However, that comes with a large performance penalty when you access the shared graph, so it may end up not improving performance versus the sequential approach. — dano
– dano, Commented May 31, 2015 at 23:04
Are you running this on Windows? If you were running this on Linux CoW should ideally kick in and prevent multiple copies of nodeIDs from being created. See this comment and the associated article — iruvar
– iruvar, Commented May 31, 2015 at 23:59

Aereaux · Accepted Answer · 2015-05-29 20:51:00Z

1

If dividing the graph into smaller parts does not work, you may be able to find a solution using this or multiprocessing.sharedctypes, depending on what kind of object your graph is.

answered May 29, 2015 at 20:51

Aereaux

8551 gold badge8 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Brent Washburne · Accepted Answer · 2015-05-29 21:02:28Z

1

Your comment indicates that you are processing a single node at a time:

# _myParallelFunction(nodeid) {calculate something for nodeid in g and save it into a file}

I would create a generator function that returns a single node from the graph file each time it's called, and pass that generator to the p.map() function instead of the entire list of nodeids.

answered May 29, 2015 at 21:02

Brent Washburne

13.2k4 gold badges65 silver badges86 bronze badges

1 Comment

dano Over a year ago

multiprocessing.Pool.map will turn the generator into a list before processing it. You have to use multiprocessing.pool.imap to avoid that.

Collectives™ on Stack Overflow

How to save up memory while using Multiprocessing in Python?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related