My script is as follows
import pandas as pd
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3']})
def make_df(year):
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], str(year): [str(year), str(year+1), str(year+2), str(year+3)]})
return df
for year in range(2020, 2015, -1):
df = pd.merge(df, make_df(year), on=['key'], how='left')
The final df will be..
key A 2020 2019 2018 2017 2016
0 K0 A0 2020 2019 2018 2017 2016
1 K1 A1 2021 2020 2019 2018 2017
2 K2 A2 2022 2021 2020 2019 2018
3 K3 A3 2023 2022 2021 2020 2019
my actual make_new_df(year) is much more complex and takes too much time.
How can I paralleize the for-loop for year in range(2020, 2015, -1): and shorten processing time?
threading,multiprocessingor external modules likeray,joblib,pysparkwhich may have some functions forDataFrame. Probably there is even module which name I don't remeber -pandas-???- which can add multiprocessing toDataFrame