0

I have a dataframe df like this

Product Yr Value
A      2014 1
A      2015 3
A      2016 2
B      2015 2
B      2016 1

I want to do max cumululative ie

Product Yr Value
A      2014 1
A      2015 3
A      2016 3
B      2015 2
B      2016 2

My actual data has about 50000 products I am writing a code like:

df2=pd.DataFrame()
for i in (df['Product'].unique()):
    data3=df[df['Product']==i]
    data3.sort_values(by=['Yr'])
    data3['Value']=data3['Value'].cummax()
    df2=df2.append(data3)

#df2 is my result

This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?

1 Answer 1

2

You can use groupby.cummax instead:

df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()

df
#Product      Yr    Value
#0     A    2014    1
#1     A    2015    3
#2     A    2016    3
#3     B    2015    2
#4     B    2016    2
Sign up to request clarification or add additional context in comments.

1 Comment

groupby is way faster than sorting unique values, then re-indexing into the df with df[df[...]==i] because the re-slice adds another O(n) step inside an already O(n) loop, making O(n^2). Additionally it passes control back to python very frequently instead of keeping the calls within the c library

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.