How to parallelize a python code that has two different pandas dataframes?

Question

I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:

sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)

bornuni = born.number.unique()
for babies in bornuni:
    datafame = born[born["id"]==number]
    for i, r in sales.iterrows():
        if r["number"] == babies:
            sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
            sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
        else:
            pass

this is pretty inefficient with bigger data sets so I want to parallelize this code but I don´t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.

cadolphs · Accepted Answer · 2022-02-10 16:03:36Z

2

So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.

I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.

It seems to me you want to accomplish the following:

For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.

There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.

answered Feb 10, 2022 at 16:03

cadolphs

9,7231 gold badge30 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Felipe Rey Londoño Over a year ago

Actually my initial code was with merging but in the larger data sets there are lots of duplicates and I can´t lose that data so I went for, for loops. I know It is very inefficent that is why I want to change that code, the idea you provided is great but sadly when I created that code I used something similar as what you proposed and sadly it didn´t work.

cadolphs Over a year ago

But then from the born-dataframe you're only taking the very first weight. In that case you might consider building a temporary dataframe where you use sorting (by the number) and then the groupby (or actually the unique function) to just grab the first weight and date of birth.

Felipe Rey Londoño Over a year ago

The dataframe that has duplicates is the sales one and I can´t lose those duplicates so using groupby would mean sacrifice those duplicates which are important to the end product

cadolphs Over a year ago

In that case I still feel doing pivot or the reverse of it should be better

Collectives™ on Stack Overflow

How to parallelize a python code that has two different pandas dataframes?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related