Loop Optimization in python

Question

I have a dataframe df like this

Product Yr Value
A      2014 1
A      2015 3
A      2016 2
B      2015 2
B      2016 1

I want to do max cumululative ie

Product Yr Value
A      2014 1
A      2015 3
A      2016 3
B      2015 2
B      2016 2

My actual data has about 50000 products I am writing a code like:

df2=pd.DataFrame()
for i in (df['Product'].unique()):
    data3=df[df['Product']==i]
    data3.sort_values(by=['Yr'])
    data3['Value']=data3['Value'].cummax()
    df2=df2.append(data3)

#df2 is my result

This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?

akuiper · Accepted Answer · 2017-03-06 19:31:32Z

2

You can use groupby.cummax instead:

df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()

df
#Product      Yr    Value
#0     A    2014    1
#1     A    2015    3
#2     A    2016    3
#3     B    2015    2
#4     B    2016    2

answered Mar 6, 2017 at 19:31

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aaron Over a year ago

groupby is way faster than sorting unique values, then re-indexing into the df with df[df[...]==i] because the re-slice adds another O(n) step inside an already O(n) loop, making O(n^2). Additionally it passes control back to python very frequently instead of keeping the calls within the c library

Collectives™ on Stack Overflow

Loop Optimization in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related