Pandas - calculate new column with variable column input

Question

heres the problem... Imagine the following dataframe as an example:

df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': [3, 4, 5, 6, 7],'col3': [3, 4, 5, 6, 7],'col4': [1, 2, 3, 3, 2]})

Now, I would like to add another column "col 5" which is calculated as follows:

if the value of "col4" is 1, then give me the corresponding value in the column with index 1 (i.e. "col2" in this case), if "col4" is 2 give me the corresponding value in the column with index 2 (i.e. "col3" in this case), etc.

I have tried the below and variations of it, but I can't seem to get the right result

df["col5"] = df.apply(lambda x: df.iloc[x,df[df.columns[df["col4"]]]])

Any help is much appreciated!

zipa · Accepted Answer · 2018-11-29 13:05:22Z

2

If your 'col4' is the indicator of column index, this will work:

df['col5'] = df.apply(lambda x: x[df.columns[x['col4']]], axis=1)

df

#   col1  col2  col3  col4  col5
#0     1     3     3     1     3
#1     2     4     4     2     4
#2     3     5     5     3     3
#3     4     6     6     3     3
#4     5     7     7     2     7

answered Nov 29, 2018 at 13:05

zipa

28k6 gold badges45 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SanMu Over a year ago

Indeed, thanks. I actually tried something similar, but without the axis argument...

jpp Over a year ago

Note this solution involves a Python-level loop. You may find a list comprehension better. See this related answer.

jpp · Accepted Answer · 2018-11-29 17:21:51Z

1

You can use fancy indexing with NumPy and avoid a Python-level loop altogether:

df['col5'] = df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']]

print(df)

   col1  col2  col3  col4  col5
0     1     3     3     1     3
1     2     4     4     2     4
2     3     5     5     3     3
3     4     6     6     3     3
4     5     7     7     2     7

You should see significant performance benefits for larger dataframes:

df = pd.concat([df]*10**4, ignore_index=True)

%timeit df.apply(lambda x: x[df.columns[x['col4']]], axis=1)       # 2.36 s per loop
%timeit df.iloc[:, :4].values[np.arange(df.shape[0]), df['col4']]  # 1.01 ms per loop

edited Nov 29, 2018 at 17:21

answered Nov 29, 2018 at 12:59

jpp

166k37 gold badges301 silver badges362 bronze badges

3 Comments

SanMu Over a year ago

Amazing, works. Thank you! One more thing: How would I add a condition to this, i.e. only do this if value in col4 is > 1 e.g. and otherwise take 0?

jpp Over a year ago

@SanMu, No problem, have updated in fact to match what I think you need, i.e. 1 maps to col2. Feel free to accept a solution that helped.

jpp Over a year ago

@SanMu, you can use np.where, e.g. np.where(df['col4'] > 1, ..., 0).

Collectives™ on Stack Overflow

Pandas - calculate new column with variable column input

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related