parallelize for loop and merge pandas dataframes

Question

My script is as follows

import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})

def make_df(year):
    df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
    return df

for year in range(2020, 2015, -1):
        df = pd.merge(df, make_df(year), on=['key'], how='left')

The final df will be..

  key   A  2020  2019  2018  2017  2016
0  K0  A0  2020  2019  2018  2017  2016
1  K1  A1  2021  2020  2019  2018  2017
2  K2  A2  2022  2021  2020  2019  2018
3  K3  A3  2023  2022  2021  2020  2019

my actual make_new_df(year) is much more complex and takes too much time.

How can I paralleize the for-loop for year in range(2020, 2015, -1): and shorten processing time?

you can try to use standard modules threading, multiprocessing or external modules like ray, joblib, pyspark which may have some functions for DataFrame. Probably there is even module which name I don't remeber - pandas-??? - which can add multiprocessing to DataFrame — furas
– furas, Commented Nov 23, 2021 at 7:02
Thank you for your comment. I have tried some modules like multiprocessing or dask, but have failed to use them. I could not find any document which explains the detailed method to use them. All I have found were about multiprocessing in ONE dataframe but not about joining MULTIPLE dataframes into one. Any document you recommend? — JHP
– JHP, Commented Nov 23, 2021 at 9:15
you may generate new data in separated threads/processes but later you have to join them in main process. — furas
– furas, Commented Nov 23, 2021 at 9:28
other idea: maybe it would be faster to send data on Google Colab server, run code and download result :) — furas
– furas, Commented Nov 23, 2021 at 9:29

Tranbi · Accepted Answer · 2021-11-23 13:53:47Z

2

Edit: using `multiprocessing` instead of `threading`

After reading your comments it seems that you want to run your function in different processes (in parallel):

import multiprocessing
import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})
year_start = 2020
year_stop = 2015
year_range = range(year_start, year_stop, -1)

def make_df(year):
    df = pd.DataFrame({str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
    return df

pool = multiprocessing.Pool(year_start - year_stop)
df_list = pool.map(func=make_df, iterable=year_range)
pool.close()
pool.join()

df = df.join(df_list)
print(df)

edited Nov 23, 2021 at 13:53

answered Nov 23, 2021 at 10:11

Tranbi

12.8k6 gold badges19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Tranbi Over a year ago

I don't know how your data actually looks like. You can still use merge with the key in a loop over df_list. It seemed unnecessary in your example since you always merged on the same column, so I initially replaced it with join (and removed the key column from the function)

JHP Over a year ago

Thanks a lot. Your script works with some minor revisions regarding join. However, I don't know why, it doubles the time. Certainly, it uses more CPU(%) while running. There seems to be something peculiar in my code.

Tranbi Over a year ago

It must be because of the Python GIL. I've updated my answer to use multiprocessing. Let me know if you see any improvement!

JHP Over a year ago

Thank you Tranbi. There is an error when doing "pool=multiprocessing.Pool(year_start - year_stop)". I have been working on it but cannot find a solution. Is there anything I miss?

Tranbi Over a year ago

Did you defineyear_start and year_stop like in my code? The first one has to be greater since it defines the size of the pool. Did you import the multiprocessing module?

|

Collectives™ on Stack Overflow

parallelize for loop and merge pandas dataframes

1 Answer 1

Edit: using `multiprocessing` instead of `threading`

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit: using multiprocessing instead of threading

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Edit: using `multiprocessing` instead of `threading`