Large Pandas Dataframe as global variable when parallel processing

Ask Question

Asked 1 year, 11 months ago

Modified 1 year, 11 months ago

Viewed 107 times

I know there exists a duplicate question

But after such a long time, are there any new method to achieve the same target?


def process(id):
    temp_df = df[id]
    return temp_df.apply(another_function)

Parallel(n_jobs=-2)(delayed(process)(id) for id in df.columns)

The dataframe seems to be copied for each process, which is not possible for large dataframe. Are there any method or packages to fix this?

asked Dec 12, 2023 at 9:57

Jun

233 bronze badges

1

Multiprocessing in CPython enforce data to be copied. In fact, this is even worse : data needs to be pickled and unpicked. This is very slow. You can use shared Numpy array for native data (e.g. like np.intXX and np.floatXX types). That is not for non-uniform dataframes or the ones containing strings/objects. This also force you to use Numpy arrays rather than dataframe which is not convenient. CPython Multithreading is fundamentally limited by the GIL.

Jérôme Richard
– Jérôme Richard

2023-12-12 12:25:53 +00:00
Commented Dec 12, 2023 at 12:25
There might be some packages reimplementing Pandas with multiple native threads, but with strong limitations on the supported input type because of the GIL. There is no way to avoid the GIL for native types. This is an issue of the CPython interpreter itself. The same is true for multiprocessing (except for shared native data). Other Python implementation like PyPy may have less strong limitation but Python is clearly not great for parallel computing anyway (it as not been designed for that in the first place)...

Jérôme Richard
– Jérôme Richard

2023-12-12 12:28:42 +00:00
Commented Dec 12, 2023 at 12:28
1

One efficient solution I found so far is to convert the dataframe to Numpy arrays and use tools like Cython and Numba so to generate a native code using multiple threads without limitations of the GIL. This assume the dataframe contains native types (again because of the CPython's GIL). This solution mostly bypass CPython. If your application consists in computing dataframes in parallel like this and speed is critical, then it is certainly better to use another language (since Numba and Cython are far from being perfect for that). This is especially true for dataframes with strings.

Jérôme Richard
– Jérôme Richard

2023-12-12 12:37:25 +00:00
Commented Dec 12, 2023 at 12:37

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Large Pandas Dataframe as global variable when parallel processing

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked