1

I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:

sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)

bornuni = born.number.unique()
for babies in bornuni:
    datafame = born[born["id"]==number]
    for i, r in sales.iterrows():
        if r["number"] == babies:
            sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
            sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
        else:
            pass

this is pretty inefficient with bigger data sets so I want to parallelize this code but I don´t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.

1 Answer 1

2

So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.

I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.

It seems to me you want to accomplish the following:

For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.

There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.

Sign up to request clarification or add additional context in comments.

4 Comments

Actually my initial code was with merging but in the larger data sets there are lots of duplicates and I can´t lose that data so I went for, for loops. I know It is very inefficent that is why I want to change that code, the idea you provided is great but sadly when I created that code I used something similar as what you proposed and sadly it didn´t work.
But then from the born-dataframe you're only taking the very first weight. In that case you might consider building a temporary dataframe where you use sorting (by the number) and then the groupby (or actually the unique function) to just grab the first weight and date of birth.
The dataframe that has duplicates is the sales one and I can´t lose those duplicates so using groupby would mean sacrifice those duplicates which are important to the end product
In that case I still feel doing pivot or the reverse of it should be better

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.