1

Imagine a funtion like the following:

def func(df, cols, col_ref):

    for c in cols:    
        df[c] = df.apply(lambda row: row[c] * ref[(ref.SOURCE == row[col_ref])].VALUE.item() ,axis=1)
    return df

When calling this function, parameters are

  1. a dataframe with multiple columns (df)
  2. one or more columns (cols)
  3. a reference column where the value of the current row indicates which row of the other dataframe (ref) is used

I can call the function e.g. like this:

df_new = func(df, ['col1','col2','col3'], 'ref_value')

or like this:

df_new2 = func(df, 'col4', 'ref_value')

Is there an alternative to the for loop? My dataframe is huge and it takes up to an hour to perform this with a for loop.

Important is, that the function is still able to handle one column as well as multiple columns as second parameter.

EDIT

A simple example:

df
+-----+------+------+------+------+-----------+
| No  | col1 | col2 | col3 | col4 | ref_value |
+-----+------+------+------+------+-----------+
| 523 |   34 |  593 |  100 |   10 | A1        |
| 523 |  100 |  100 |  100 |   43 | A1        |
| 523 | 1867 |   15 |  632 |   64 | B2        |
| 732 |  100 |  943 |  375 |  325 | B1        |
| 732 | 1000 |  656 |  235 |   63 | B1        |
+-----+------+------+------+------+-----------+

ref
+--------+-------+
| SOURCE | VALUE |
+--------+-------+
| A1     |    10 |
| B1     |  1000 |
| B2     |   100 |
+--------+-------+

output:

df_new
+-----+---------+--------+--------+------+-----------+
| No  |  col1   |  col2  |  col3  | col4 | ref_value |
+-----+---------+--------+--------+------+-----------+
| 523 |     340 |   5930 |   1000 |   10 | A1        |
| 523 |    1000 |   1000 |   1000 |   43 | A1        |
| 523 |  186700 |   1500 |  63200 |   64 | B2        |
| 732 |  100000 | 943000 | 375000 |  325 | B1        |
| 732 | 1000000 | 656000 | 235000 |   63 | B1        |
+-----+---------+--------+--------+------+-----------+
4
  • Firstly, what is the function doing? Commented Jul 23, 2018 at 7:43
  • I added more code. Thought it might be easier without. I just perform a simple mathemaitcal operation. Commented Jul 23, 2018 at 7:53
  • @MaMo - Is possible add some sample data? It seems some join should be possible here. Commented Jul 23, 2018 at 7:54
  • I added a really simple example. I want to mention one more time, that my coding already works but not efficient caused by the for loop Commented Jul 23, 2018 at 8:53

1 Answer 1

1

I think better is use this vectorized solution - multiple by mul with Series created by map:

c = ['col1','col2','col3']
df[c] = df[c].mul(df['ref_value'].map(ref.set_index('SOURCE')['VALUE']), axis=0)
print (df)
    No     col1    col2    col3  col4 ref_value
0  523      340    5930    1000    10        A1
1  523     1000    1000    1000    43        A1
2  523   186700    1500   63200    64        B2
3  732   100000  943000  375000   325        B1
4  732  1000000  656000  235000    63        B1

Detail:

print (df['ref_value'].map(ref.set_index('SOURCE')['VALUE']))
0      10
1      10
2     100
3    1000
4    1000
Name: ref_value, dtype: int64

If need create function:

def func(df, cols, col_ref):
    df[cols] = df[cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)
    return df

df_new = func(df, ['col1','col2','col3'], 'ref_value')
print (df_new)

    No     col1    col2    col3  col4 ref_value
0  523      340    5930    1000    10        A1
1  523     1000    1000    1000    43        A1
2  523   186700    1500   63200    64        B2
3  732   100000  943000  375000   325        B1
4  732  1000000  656000  235000    63        B1
Sign up to request clarification or add additional context in comments.

5 Comments

perfect, saves me ages! Thank you :) If I need to leave out a specific ref_value, e.g. I want to keep col1,col2,col3 for ref_value == 'B2', can I add this to the function in a nice way? Now, I subset the dataframe before I call the function but then I have to merge the returned dataframe with the rows I left out again.
@MaMo - I think boolean indexing is fine, call df = df[df.ref_value == 'B2'] before df[cols] = df[cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)
So I split before multiplication and combine the result and the original df with merge afterwards?
I think need something else, if want apply function only by condition mask = df.ref_value == 'B2' and then df.loc[mask, cols] = df.loc[mask, cols].mul(df[col_ref].map(ref.set_index('SOURCE')['VALUE']), axis=0)
I have to use not equal like this mask = df.ref_value != 'B2' to exclude those rows. But besides this, your solution works perfectly and is very fast. Thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.