How to optimize python Pandas iteration?

Question

I have a dataframe looks like this:

''' df: 
        index, sales_fraction, Selected, T_value, A_value, D_value
        1       0.33            t          0.3343   0.33434   0.33434 
        2       0.45            a          0.3434   0.23232   0.33434 
        3       0.56            d          0.3434   0.33434   0.6767
        4       0.545           t          0.3434   0.33434   0.3346
        5       0.343           d          0.2323   0.96342   0.2323
'''

I have a function like this:

def aggregation(df):       

            df['sales_fraction'] = df['volume']/df['volume'].sum()
            res = 0
            for ix, row in df.iterrows():
                if row['Selected'] == 't':
                    res += row['sales_fraction'] * row['T_value']
                elif row['Selected'] == 'a':
                    res += row['sales_fraction'] * row['A_value']
                elif row['Selected'] == 'd':
                    res += row['sales_fraction'] * row['D_value']                    

            return res

It runs super slow as I need to use aggregation function for millions of times within another function. Any suggestion how I can optimize my code? I would greatly appreciate your help. Thank you!

Use vectorized operations. This should run really quickly, no need for for loops..Take a look at np.select and np,sum — rafaelc
– rafaelc, Commented May 6, 2019 at 14:18

jose_bacoy · Accepted Answer · 2019-05-06 15:06:10Z

1

This function uses lookup and sum

def aggregation(df):  
    return sum(df.lookup(df.index, df['Selected'].str.upper() +'_value')*df['sales_fraction'])

answered May 6, 2019 at 15:06

jose_bacoy

12.7k1 gold badge25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Elsa Li Over a year ago

Thanks! It is really elegant one-liner. @âńōŋŷXmoůŜ

Mohit Motwani · Accepted Answer · 2019-05-06 14:27:32Z

1

You can use np.select and np.sum:

cond1 = df['Selected'] == 't' 
cond2= df['Selected'] =='a'
cond3 = df['Selected']=='d'
val1 = df['sales_fraction'] * df['T_value']
val2 = df['sales_fraction'] * df['a_value']
val3 = df['sales_fraction'] * df['D_value']
conditions = [cond1, cond2, cond3]
values = [val1, val2, val3]

res = np.sum(np.select(conditions, values))

The np.select can accept multiple conditions and return corresponding values for those conditions. So you can have a list of conditions and a list of values and pass it to np.select. Then np.sum will return a sum of all the values

answered May 6, 2019 at 14:27

Mohit Motwani

4,8124 gold badges21 silver badges50 bronze badges

Comments

BENY · Accepted Answer · 2019-05-06 14:29:35Z

1

I am using lookup

s=df.loc[:,'T_value':]
s.columns=s.columns.str.split('_').str[0]
np.sum(df.sales_fraction*s.lookup(s.index,df.Selected.str.upper()))
Out[1421]: 0.8606469

answered May 6, 2019 at 14:29

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Elsa Li Over a year ago

Thank you for the help! @WeNYoBen

Quang Hoang · Accepted Answer · 2019-05-06 14:33:30Z

1

Try pd.get_dummies():

weights = pd.get_dummies(df.Selected)[['t','a', 'd']]
selected = (df[['T_value', 'A_value', 'D_value']].values * weights.values).sum(1)
(selected * df['sales_fraction']).sum()

# 0.8606469

answered May 6, 2019 at 14:33

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

2 Comments

Elsa Li Over a year ago

Thanks @Quang Hoang! It seems that if any of ['a', 't', 'd'] is not listed in the column df['Selected'], the first line of code start complaining like this: KeyError: "['a' 'd'] not in index"

Quang Hoang Over a year ago

@ElsaLi that's true, I did not think of that.

atgbadr · Accepted Answer · 2019-05-06 15:11:01Z

1

if i understood correctly how you are doing your calculations then may i suggest that you try things with this line of code and compare it to your function results (Everything is inline) :

(df.loc[df["Selected"] == 't',"T_value"] * df.loc[df["Selected"] == 
't',"sales_fraction"]).sum() + (df.loc[df["Selected"] == 'a',"A_value"] * 
df.loc[df["Selected"] == 'a',"sales_fraction"]).sum()+(df.loc[df["Selected"] == 
'd',"D_value"] * df.loc[df["Selected"] == 'd',"sales_fraction"]).sum()

answered May 6, 2019 at 15:11

atgbadr

612 bronze badges

1 Comment

Elsa Li Over a year ago

Thanks for the help! @A.Eddine

Collectives™ on Stack Overflow

How to optimize python Pandas iteration?

5 Answers 5

1 Comment

Comments

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related