0

I have a dataframe looks like this:

''' df: 
        index, sales_fraction, Selected, T_value, A_value, D_value
        1       0.33            t          0.3343   0.33434   0.33434 
        2       0.45            a          0.3434   0.23232   0.33434 
        3       0.56            d          0.3434   0.33434   0.6767
        4       0.545           t          0.3434   0.33434   0.3346
        5       0.343           d          0.2323   0.96342   0.2323
''' 

I have a function like this:

def aggregation(df):       

            df['sales_fraction'] = df['volume']/df['volume'].sum()
            res = 0
            for ix, row in df.iterrows():
                if row['Selected'] == 't':
                    res += row['sales_fraction'] * row['T_value']
                elif row['Selected'] == 'a':
                    res += row['sales_fraction'] * row['A_value']
                elif row['Selected'] == 'd':
                    res += row['sales_fraction'] * row['D_value']                    

            return res

It runs super slow as I need to use aggregation function for millions of times within another function. Any suggestion how I can optimize my code? I would greatly appreciate your help. Thank you!

1
  • 4
    Use vectorized operations. This should run really quickly, no need for for loops..Take a look at np.select and np,sum Commented May 6, 2019 at 14:18

5 Answers 5

1

This function uses lookup and sum

def aggregation(df):  
    return sum(df.lookup(df.index, df['Selected'].str.upper() +'_value')*df['sales_fraction'])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! It is really elegant one-liner. @âńōŋŷXmoůŜ
1

You can use np.select and np.sum:

cond1 = df['Selected'] == 't' 
cond2= df['Selected'] =='a'
cond3 = df['Selected']=='d'
val1 = df['sales_fraction'] * df['T_value']
val2 = df['sales_fraction'] * df['a_value']
val3 = df['sales_fraction'] * df['D_value']
conditions = [cond1, cond2, cond3]
values = [val1, val2, val3]

res = np.sum(np.select(conditions, values))

The np.select can accept multiple conditions and return corresponding values for those conditions. So you can have a list of conditions and a list of values and pass it to np.select. Then np.sum will return a sum of all the values

Comments

1

I am using lookup

s=df.loc[:,'T_value':]
s.columns=s.columns.str.split('_').str[0]
np.sum(df.sales_fraction*s.lookup(s.index,df.Selected.str.upper()))
Out[1421]: 0.8606469

1 Comment

Thank you for the help! @WeNYoBen
1

Try pd.get_dummies():

weights = pd.get_dummies(df.Selected)[['t','a', 'd']]
selected = (df[['T_value', 'A_value', 'D_value']].values * weights.values).sum(1)
(selected * df['sales_fraction']).sum()

# 0.8606469

2 Comments

Thanks @Quang Hoang! It seems that if any of ['a', 't', 'd'] is not listed in the column df['Selected'], the first line of code start complaining like this: KeyError: "['a' 'd'] not in index"
@ElsaLi that's true, I did not think of that.
1

if i understood correctly how you are doing your calculations then may i suggest that you try things with this line of code and compare it to your function results (Everything is inline) :

(df.loc[df["Selected"] == 't',"T_value"] * df.loc[df["Selected"] == 
't',"sales_fraction"]).sum() + (df.loc[df["Selected"] == 'a',"A_value"] * 
df.loc[df["Selected"] == 'a',"sales_fraction"]).sum()+(df.loc[df["Selected"] == 
'd',"D_value"] * df.loc[df["Selected"] == 'd',"sales_fraction"]).sum()

1 Comment

Thanks for the help! @A.Eddine

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.