0

My script is as follows

import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})

def make_df(year):
    df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
    return df

for year in range(2020, 2015, -1):
        df = pd.merge(df, make_df(year), on=['key'], how='left')

The final df will be..

  key   A  2020  2019  2018  2017  2016
0  K0  A0  2020  2019  2018  2017  2016
1  K1  A1  2021  2020  2019  2018  2017
2  K2  A2  2022  2021  2020  2019  2018
3  K3  A3  2023  2022  2021  2020  2019

my actual make_new_df(year) is much more complex and takes too much time.

How can I paralleize the for-loop for year in range(2020, 2015, -1): and shorten processing time?

4
  • you can try to use standard modules threading, multiprocessing or external modules like ray, joblib, pyspark which may have some functions for DataFrame. Probably there is even module which name I don't remeber - pandas-??? - which can add multiprocessing to DataFrame Commented Nov 23, 2021 at 7:02
  • Thank you for your comment. I have tried some modules like multiprocessing or dask, but have failed to use them. I could not find any document which explains the detailed method to use them. All I have found were about multiprocessing in ONE dataframe but not about joining MULTIPLE dataframes into one. Any document you recommend? Commented Nov 23, 2021 at 9:15
  • you may generate new data in separated threads/processes but later you have to join them in main process. Commented Nov 23, 2021 at 9:28
  • other idea: maybe it would be faster to send data on Google Colab server, run code and download result :) Commented Nov 23, 2021 at 9:29

1 Answer 1

2
Edit: using multiprocessing instead of threading

After reading your comments it seems that you want to run your function in different processes (in parallel):

import multiprocessing
import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})
year_start = 2020
year_stop = 2015
year_range = range(year_start, year_stop, -1)

def make_df(year):
    df = pd.DataFrame({str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
    return df

pool = multiprocessing.Pool(year_start - year_stop)
df_list = pool.map(func=make_df, iterable=year_range)
pool.close()
pool.join()

df = df.join(df_list)
print(df)
Sign up to request clarification or add additional context in comments.

7 Comments

I don't know how your data actually looks like. You can still use merge with the key in a loop over df_list. It seemed unnecessary in your example since you always merged on the same column, so I initially replaced it with join (and removed the key column from the function)
Thanks a lot. Your script works with some minor revisions regarding join. However, I don't know why, it doubles the time. Certainly, it uses more CPU(%) while running. There seems to be something peculiar in my code.
It must be because of the Python GIL. I've updated my answer to use multiprocessing. Let me know if you see any improvement!
Thank you Tranbi. There is an error when doing "pool=multiprocessing.Pool(year_start - year_stop)". I have been working on it but cannot find a solution. Is there anything I miss?
Did you defineyear_start and year_stop like in my code? The first one has to be greater since it defines the size of the pool. Did you import the multiprocessing module?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.