Combine Pool.map with shared memory Array in Python multiprocessing

Question

I have a very large (read only) array of data that I want to be processed by multiple processes in parallel.

I like the Pool.map function and would like to use it to calculate functions on that data in parallel.

I saw that one can use the Value or Array class to use shared memory data between processes. But when I try to use this I get a RuntimeError: 'SynchronizedString objects should only be shared between processes through inheritance when using the Pool.map function:

Here is a simplified example of what I am trying to do:

from sys import stdin
from multiprocessing import Pool, Array

def count_it( arr, key ):
  count = 0
  for c in arr:
    if c == key:
      count += 1
  return count

if __name__ == '__main__':
  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
  # want to share it using shared memory
  toShare = Array('c', testData)

  # this works
  print count_it( toShare, "a" )

  pool = Pool()

  # RuntimeError here
  print pool.map( count_it, [(toShare,key) for key in ["a", "b", "s", "d"]] )

Can anyone tell me what I am doing wrong here?

So what I would like to do is pass info about a newly created shared memory allocated array to the processes after they have been created in the process pool.

Unfortunately that's not possible. The recommended way according to mp documentation is to use inheritence (on fork platforms). For read only data as you have here one would normally use a global, but can used a shared Array for read/write communication. Forking is cheap so you can recreate the Pool whenever you receive the data, then close it afterwards. Unfortunately, on Windows this isn't possible - the workaround is to use a shared memory Array (even in the read only case) but this can only be passed to subprocesses at process creation (I imagine they need to be added to the access list... — robince
– robince, Commented Nov 13, 2009 at 10:24
for the shared memory segment and that this logic isn't implemented except at subprocess startup). You can pass the shared data array at Pool start up as I showed, or to a Process in a similar way. You can't pass a shared memory Array to an open Pool - you have to create the Pool after the memory. Easy ways around this include allocating a maximum size buffer, or just allocating the array when you know the required size before starting the Pool. If you keep your global variables down Pool shouldn't be too expensive on windows either - global variables are automatically ... — robince
– robince, Commented Nov 13, 2009 at 10:27
pickled and sent to the subprocesses - which is why I my suggestion to make one buffer of sufficient size at the start (where hopefully your amount of global variables is small), then Pool, is better. I took the time to understand and solve your problem in good faith - before you edited your question - so while I understand if you want to let it run, I hope at the end you will consider accepting my answer if nothing substantially different/better comes along. — robince
– robince, Commented Nov 13, 2009 at 10:30
I had a closer look at the source code and the information about the shared memory can be pickled (needed to get info about it over to the client process on windows) but that code has an assert to only run during process spawning. I wonder why that is. — Jeroen Dirks
– Jeroen Dirks, Commented Nov 13, 2009 at 15:32

robince · Accepted Answer · 2009-11-12 18:19:03Z

65

+250

Trying again as I just saw the bounty ;)

Basically I think the error message means what it said - multiprocessing shared memory Arrays can't be passed as arguments (by pickling). It doesn't make sense to serialise the data - the point is the data is shared memory. So you have to make the shared array global. I think it's neater to put it as the attribute of a module, as in my first answer, but just leaving it as a global variable in your example also works well. Taking on board your point of not wanting to set the data before the fork, here is a modified example. If you wanted to have more than one possible shared array (and that's why you wanted to pass toShare as an argument) you could similarly make a global list of shared arrays, and just pass the index to count_it (which would become for c in toShare[i]:).

from sys import stdin
from multiprocessing import Pool, Array, Process

def count_it( key ):
  count = 0
  for c in toShare:
    if c == key:
      count += 1
  return count

if __name__ == '__main__':
  # allocate shared array - want lock=False in this case since we 
  # aren't writing to it and want to allow multiple processes to access
  # at the same time - I think with lock=True there would be little or 
  # no speedup
  maxLength = 50
  toShare = Array('c', maxLength, lock=False)

  # fork
  pool = Pool()

  # can set data after fork
  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
  if len(testData) > maxLength:
      raise ValueError, "Shared array too small to hold data"
  toShare[:len(testData)] = testData

  print pool.map( count_it, ["a", "b", "s", "d"] )

[EDIT: The above doesn't work on windows because of not using fork. However, the below does work on Windows, still using Pool, so I think this is the closest to what you want:

from sys import stdin
from multiprocessing import Pool, Array, Process
import mymodule

def count_it( key ):
  count = 0
  for c in mymodule.toShare:
    if c == key:
      count += 1
  return count

def initProcess(share):
  mymodule.toShare = share

if __name__ == '__main__':
  # allocate shared array - want lock=False in this case since we 
  # aren't writing to it and want to allow multiple processes to access
  # at the same time - I think with lock=True there would be little or 
  # no speedup
  maxLength = 50
  toShare = Array('c', maxLength, lock=False)

  # fork
  pool = Pool(initializer=initProcess,initargs=(toShare,))

  # can set data after fork
  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
  if len(testData) > maxLength:
      raise ValueError, "Shared array too small to hold data"
  toShare[:len(testData)] = testData

  print pool.map( count_it, ["a", "b", "s", "d"] )

Not sure why map won't Pickle the array but Process and Pool will - I think perhaps it has be transferred at the point of the subprocess initialization on windows. Note that the data is still set after the fork though.

edited Nov 12, 2009 at 18:19

answered Nov 12, 2009 at 12:39

robince

11k3 gold badges38 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Jeroen Dirks Over a year ago

Even on platforms with fork you can not insert new shared data into toShare after the fork since each process will have its own independent copy at that point.

Jeroen Dirks Over a year ago

So the real problem seems to be that how we can pickle the information about an Array so it can be send and connected to from the other process.

robince Over a year ago

@James - no that's not right. The array has to be set up before the fork, but then it is shared memory that can be changed, with changes visible across all children. Look at the example - I put the data into the array after the fork (which occure when Pool() is instantiated). That data could be obtained at run time, after the fork, and as long as it fits into the preallocated shared memory segment it can be copied there and seen from all children.

jwilson Over a year ago

You can pickle the Array, but not using Pool.

robince Over a year ago

Editted to add working Windows version, using only Pool (by passing the shared array as an initiliazation parameter.

|

Asclepius · Accepted Answer · 2024-05-15 18:31:44Z

13

If you're seeing:

RuntimeError: Synchronized objects should only be shared between processes through inheritance

Consider using multiprocessing.Manager as it doesn't have this limitation. The manager works considering it presumably runs in a separate process altogether.

import ctypes
import multiprocessing

# Put this in a method or function, otherwise it will run on import from each module:
manager = multiprocessing.Manager()
counter = manager.Value(ctypes.c_ulonglong, 0)
counter_lock = manager.Lock()  # pylint: disable=no-member

with counter_lock:
    counter.value = count = counter.value + 1

Alternatively, consider Python 3.13+ with its GIL disabled. It implicitly has shared memory with threads. Refer to Free-threaded CPython. Note, however, that it is slower per thread.

edited May 15, 2024 at 18:31

answered Oct 2, 2019 at 20:08

Asclepius

64.6k20 gold badges188 silver badges164 bronze badges

2 Comments

raphael Over a year ago

this was the only suggestion I actually got working when using a multiprocessing.Pool ... and I did not need the explicit treatment of manager.Lock

Asclepius Over a year ago

@raphael Are you asserting that the Value has an implicit lock? The explicit lock is there to prevent a race condition, and thereby prevent erroneous counts when updating the count from multiple processes.

jwilson · Accepted Answer · 2009-11-10 02:08:28Z

The problem I see is that Pool doesn't support pickling shared data through its argument list. That's what the error message means by "objects should only be shared between processes through inheritance". The shared data needs to be inherited, i.e., global if you want to share it using the Pool class.

If you need to pass them explicitly, you may have to use multiprocessing.Process. Here is your reworked example:

from multiprocessing import Process, Array, Queue

def count_it( q, arr, key ):
  count = 0
  for c in arr:
    if c == key:
      count += 1
  q.put((key, count))

if __name__ == '__main__':
  testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
  # want to share it using shared memory
  toShare = Array('c', testData)

  q = Queue()
  keys = ['a', 'b', 's', 'd']
  workers = [Process(target=count_it, args = (q, toShare, key))
    for key in keys]

  for p in workers:
    p.start()
  for p in workers:
    p.join()
  while not q.empty():
    print q.get(),

Output: ('s', 9) ('a', 2) ('b', 3) ('d', 12)

The ordering of elements of the queue may vary.

To make this more generic and similar to Pool, you could create a fixed N number of Processes, split the list of keys into N pieces, and then use a wrapper function as the Process target, which will call count_it for each key in the list it is passed, like:

def wrapper( q, arr, keys ):
  for k in keys:
    count_it(q, arr, k)

robince · Accepted Answer · 2009-11-04 20:37:31Z

8

If the data is read only just make it a variable in a module before the fork from Pool. Then all the child processes should be able to access it, and it won't be copied provided you don't write to it.

import myglobals # anything (empty .py file)
myglobals.data = []

def count_it( key ):
    count = 0
    for c in myglobals.data:
        if c == key:
            count += 1
    return count

if __name__ == '__main__':
myglobals.data = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"

pool = Pool()
print pool.map( count_it, ["a", "b", "s", "d"] )

If you do want to try to use Array though you could try with the lock=False keyword argument (it is true by default).

edited Nov 4, 2009 at 20:37

answered Nov 4, 2009 at 20:16

robince

11k3 gold badges38 silver badges49 bronze badges

3 Comments

Jeroen Dirks Over a year ago

I do not believe the use of globals is safe and would certainly not work on windows where the processes are not forked.

robince Over a year ago

How is it not safe? If you only need read access to the data it is fine. If you write to it by mistake, then the modified page will be copied-on-write for the child process so nothing bad will happen (wouldn't interfere with other processes for example). You're right it won't work on windows though...

Jeroen Dirks Over a year ago

You are right that it is safe on fork based platforms. But I would like to know if there is a shared memory based way to share large amounts of data after the process pool is created.

durranaik · Accepted Answer · 2024-09-09 14:05:32Z

0

from multiprocessing import Pool, Array

testData = "abcabcs bsdfsdf gdfg dffdgdfg sdfsdfsd sdfdsfsdf"
toShare = Array('c', testData.encode(), lock=False)

def count_it( key ):
  count = 0
  for c in toShare:
    if c.decode() == key:
      count += 1
  return key,count

if __name__ == '__main__':

  pool = Pool()

  print(pool.map( count_it, ["a", "b", "s", "d"] ))
#
# [('a', 2), ('b', 3), ('s', 9), ('d', 12)]

answered Sep 9, 2024 at 14:05

durranaik

511 silver badge3 bronze badges

Collectives™ on Stack Overflow

Combine Pool.map with shared memory Array in Python multiprocessing

5 Answers 5

9 Comments

2 Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

9 Comments

2 Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related