Pandas: Conditional column creating

Question

I'm trying to create column C, based on the values in columns A and B given the following conditions:

if A < 5000: C = A * B
else: C = A

The following gives a syntax error:

df['C'] = df.apply(lambda x (x['A'] * x['B)'] if x['A'] < 5000 else x = x['A']),axis=1)

How far off am I?

jezrael · Accepted Answer · 2019-01-10 08:00:32Z

4

Use vectorized numpy.where:

df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])

Performance:

np.random.seed(2019)

N = 1000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

In [56]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
536 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [57]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
30.9 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

N = 100000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

In [59]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
1.29 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [60]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
3.32 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Jan 10, 2019 at 8:00

answered Jan 10, 2019 at 7:54

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Preben Brudvik Olsen Over a year ago

Much appreciated!

Preben Brudvik Olsen Over a year ago

.where is definitely more efficient than applying lambda. Thanks!

Stefan Falk Over a year ago

Oh, that's indeed a big difference in performance. Wasn't aware that it's that slow.

jezrael Over a year ago

@displayname - ya, problem is apply are loops under the hood.

Stefan Falk Over a year ago

@jezrael I see - thanks for the info. Will consider that in the future :) So whenever I can avoid apply I should avoid it I guess.

|

Stefan Falk · Accepted Answer · 2019-01-10 07:57:41Z

1

I think you'd want something like

df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)

Complete example:

import pandas as pd
import numpy as np

N = 10
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)

answered Jan 10, 2019 at 7:57

Stefan Falk

25.8k62 gold badges225 silver badges422 bronze badges

1 Comment

Preben Brudvik Olsen Over a year ago

Thanks! So my error basically boiled down to an equal sign, haha!

Preben Brudvik Olsen · Accepted Answer · 2019-01-10 08:24:34Z

0

I'm sure the solutions provided before this one are better but I solved it a third way. The dataset is rather small so it'll do for now.

multiply = df['A'] * df['B'] df['C'] = multiply.where(df['A'] < 5000, other=df['A'])

answered Jan 10, 2019 at 8:24

Preben Brudvik Olsen

631 gold badge1 silver badge6 bronze badges

Collectives™ on Stack Overflow

Pandas: Conditional column creating

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related