Everywhere I see shared memory implementations for python (e.g. in multiprocessing), creating shared memory always allocates new memory. Is there a way to create a shared memory object and have it refer to existing memory? The purpose would be to pre-initialize the data values, or rather, to avoid having to copy into the new shared memory if we already have, say, an array in hand. In my experience, allocating a large shared array is much faster than copying values into it.
2 Answers
The short answer is no.
I'm the author of the Python extensions posix_ipc1 and sysv_ipc2. Like Python's multiprocessing module from the standard library, my modules are just wrappers around facilities provided by the operating system, so what you really need to know is what the OS allows when allocating shared memory. That differs a little for SysV IPC and POSIX IPC, but in this context the difference isn't really important. (I think multiprocessing uses POSIX IPC where possible.)
For SysV IPC, the OS-level call to allocate shared memory is shmget(). You can see on that call's man page that it doesn't accept a pointer to existing memory; it always allocates new memory for you. Ditto for the POSIX IPC version of the same call (shm_open()). POSIX IPC is interesting because it implements shared memory to look like a memory mapped file, so it behaves a bit differently from SysV IPC.
Regardless, whether one is calling from Python or C, there's no option to ask the operating system to turn an existing piece of private memory into shared memory.
If you think about it, you'll see why. Suppose you could pass a pointer to a chunk of private memory to shmget() or shm_open(). Now the operating system is stuck with the job of keeping that memory where it is until all sharing processes are done with it. What if it's in the middle of your stack? Suddenly this big chunk of your stack can't be allocated because other processes are using it. It also means that when your process dies, the OS can't release all its memory because some of it is now being used by other processes.
In short, what you're asking for from Python isn't offered because the underlying OS calls don't allow it, and the underlying OS calls don't allow it (probably) because it would be really messy for the OS.
3 Comments
Actually, simply creating the shared memory location first and then assigning new data works (and does not duplicate memory)
Server side: (sets the data)
import numpy as np
from multiprocessing import shared_memory
my_data = np.arange(1000, dtype=np.int32) # example data
# create the shared memory
shm = shared_memory.SharedMemory(name="foodata", create=True, size=my_data.nbytes)
# assign the data to the buffer
# (this will actually point to the data in the 'my_data' object)
shm.buf[:] = my_data.tobytes()
.
(other activities, make sure the server stays alive)
.
If you want to assign only part of the data to the buffer, address it with a size parameter:
shm.buf[:500 * np.int32().itemsize] = my_data[:500].tobytes()
Client side: (receives the data)
import numpy as np
from multiprocessing import shared_memory
shm = shared_memory(name="foodata", create=False)
# this is the size you want to receive. It can be smaller than allocated size
size = 750
my_client_data = np.ndarray((size,), dtype=np.int32, buffer=shm.buf)
.
(do other stuff)
.
# when done, remove reference to the data
shm.close()
shm.unlink()
del(my_client_data) # if you want to prevent further referencing dangers
In these case, the memory locations are mapped and it is not copied, so it still remains efficient. It is important to note that the same data types must be used for both (np.float32, np.int32, etc)