For loop alternative for multiple columns within a function (pandas)

Question

Imagine a funtion like the following:

def func(df, cols, col_ref):

    for c in cols:    
        df[c] = df.apply(lambda row: row[c] * ref[(ref.SOURCE == row[col_ref])].VALUE.item() ,axis=1)
    return df

When calling this function, parameters are

a dataframe with multiple columns (df)
one or more columns (cols)
a reference column where the value of the current row indicates which row of the other dataframe (ref) is used

I can call the function e.g. like this:

df_new = func(df, ['col1','col2','col3'], 'ref_value')

or like this:

df_new2 = func(df, 'col4', 'ref_value')

Is there an alternative to the for loop? My dataframe is huge and it takes up to an hour to perform this with a for loop.

Important is, that the function is still able to handle one column as well as multiple columns as second parameter.

EDIT

A simple example:

df
+-----+------+------+------+------+-----------+
| No  | col1 | col2 | col3 | col4 | ref_value |
+-----+------+------+------+------+-----------+
| 523 |   34 |  593 |  100 |   10 | A1        |
| 523 |  100 |  100 |  100 |   43 | A1        |
| 523 | 1867 |   15 |  632 |   64 | B2        |
| 732 |  100 |  943 |  375 |  325 | B1        |
| 732 | 1000 |  656 |  235 |   63 | B1        |
+-----+------+------+------+------+-----------+

ref
+--------+-------+
| SOURCE | VALUE |
+--------+-------+
| A1     |    10 |
| B1     |  1000 |
| B2     |   100 |
+--------+-------+

output:

df_new
+-----+---------+--------+--------+------+-----------+
| No  |  col1   |  col2  |  col3  | col4 | ref_value |
+-----+---------+--------+--------+------+-----------+
| 523 |     340 |   5930 |   1000 |   10 | A1        |
| 523 |    1000 |   1000 |   1000 |   43 | A1        |
| 523 |  186700 |   1500 |  63200 |   64 | B2        |
| 732 |  100000 | 943000 | 375000 |  325 | B1        |
| 732 | 1000000 | 656000 | 235000 |   63 | B1        |
+-----+---------+--------+--------+------+-----------+

I added more code. Thought it might be easier without. I just perform a simple mathemaitcal operation. — MaMo
– MaMo, Commented Jul 23, 2018 at 7:53
@MaMo - Is possible add some sample data? It seems some join should be possible here. — jezrael
– jezrael, Commented Jul 23, 2018 at 7:54
I added a really simple example. I want to mention one more time, that my coding already works but not efficient caused by the for loop — MaMo
– MaMo, Commented Jul 23, 2018 at 8:53

jezrael · Accepted Answer · 2018-07-23 09:10:45Z

1

I think better is use this vectorized solution - multiple by mul with Series created by map:

c = ['col1','col2','col3']
df[c] = df[c].mul(df['ref_value'].map(ref.set_index('SOURCE')['VALUE']), axis=0)
print (df)
    No     col1    col2    col3  col4 ref_value
0  523      340    5930    1000    10        A1
1  523     1000    1000    1000    43        A1
2  523   186700    1500   63200    64        B2
3  732   100000  943000  375000   325        B1
4  732  1000000  656000  235000    63        B1

Detail:

print (df['ref_value'].map(ref.set_index('SOURCE')['VALUE']))
0      10
1      10
2     100
3    1000
4    1000
Name: ref_value, dtype: int64

If need create function:

def func(df, cols, col_ref):
    df[cols] = df[cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)
    return df

df_new = func(df, ['col1','col2','col3'], 'ref_value')
print (df_new)

    No     col1    col2    col3  col4 ref_value
0  523      340    5930    1000    10        A1
1  523     1000    1000    1000    43        A1
2  523   186700    1500   63200    64        B2
3  732   100000  943000  375000   325        B1
4  732  1000000  656000  235000    63        B1

edited Jul 23, 2018 at 9:10

answered Jul 23, 2018 at 9:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MaMo Over a year ago

perfect, saves me ages! Thank you :) If I need to leave out a specific ref_value, e.g. I want to keep col1,col2,col3 for ref_value == 'B2', can I add this to the function in a nice way? Now, I subset the dataframe before I call the function but then I have to merge the returned dataframe with the rows I left out again.

jezrael Over a year ago

@MaMo - I think boolean indexing is fine, call df = df[df.ref_value == 'B2'] before df[cols] = df[cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)

MaMo Over a year ago

So I split before multiplication and combine the result and the original df with merge afterwards?

jezrael Over a year ago

I think need something else, if want apply function only by condition mask = df.ref_value == 'B2' and then df.loc[mask, cols] = df.loc[mask, cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)

MaMo Over a year ago

I have to use not equal like this mask = df.ref_value != 'B2' to exclude those rows. But besides this, your solution works perfectly and is very fast. Thank you so much!

Collectives™ on Stack Overflow

For loop alternative for multiple columns within a function (pandas)

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related