How can multiple child processes write in the same shared memory dataframe in python?

Question

My python code employs multiprocessing. There is a dataframe which is created in the shared memory in the parent program, lets say, ns.df where ns is the namespace manager instance.

The multiple processes need to add rows of data to this ns.df, so that all changes are reflected, once the processes terminate, in the parent program.

The processes are not required to interact with each other as in there is no sharing or passing of data between processes. The data to be written by each process is exclusive and independent of that process only.

Woluld doing a simple

ns.df = pd.concat([ns.df, tempdf], axis=0, sort=True)

from within each of the child processes suffice in achieving the desired result? Here tempdf be the dataframe with the required data to be added to ns.df.

How can i achieve this in python? Any help would be appreciated.

@mrzo thanku for your reply. I am working on the solution provided by you and got stuck with another small thingy!! :) Is there a way to turn the following piece of code into a list comprehension?? child_id=0 for item in finalargpool: item.append(child_id) child_id+=1 — Prachi
– Prachi, Commented Apr 20, 2020 at 10:18

mrzo · Accepted Answer · 2020-04-20 08:51:13Z

I would not add the rows to ns.df inside each child process individually but collect them after each child process has terminated. Look at this example:

from concurrent.futures import ProcessPoolExecutor

import pandas as pd

def child_process(child_id):
    return pd.DataFrame({"column": [f"child_{child_id}"]})

df_main = pd.DataFrame({"column": ["parent"]})

with ProcessPoolExecutor(max_workers=4) as pool:
    child_dfs = list(pool.map(child_process, range(5)))

df_all = pd.concat([df_main, *child_dfs])
print(df_all)

Output

    column
0   parent
0  child_0
0  child_1
0  child_2
0  child_3
0  child_4

If you changed ns.df inside each child process, it would be actually a shared memory object.

Caveat: if the returned dataframes of the child processes are very big, then using multiprocessing might add significant overhead since the dataframes have to be pickled before reloading them in the main process. Depending on what the actual child process does (maybe a lot of I/O or it uses C functions which release the GIL), it might be better to use multithreading instead.

Collectives™ on Stack Overflow

How can multiple child processes write in the same shared memory dataframe in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related