1

I'm trying to create column C, based on the values in columns A and B given the following conditions:

if A < 5000: C = A * B
else: C = A

The following gives a syntax error:

df['C'] = df.apply(lambda x (x['A'] * x['B)'] if x['A'] < 5000 else x = x['A']),axis=1)

How far off am I?

3 Answers 3

4

Use vectorized numpy.where:

df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])

Performance:

np.random.seed(2019)

N = 1000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

In [56]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
536 µs ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [57]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
30.9 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

N = 100000
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

In [59]: %timeit df['C'] = np.where(df['A'] < 5000, df['A'] * df['B'], df['A'])
1.29 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [60]: %timeit df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)
3.32 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

6 Comments

Much appreciated!
.where is definitely more efficient than applying lambda. Thanks!
Oh, that's indeed a big difference in performance. Wasn't aware that it's that slow.
@displayname - ya, problem is apply are loops under the hood.
@jezrael I see - thanks for the info. Will consider that in the future :) So whenever I can avoid apply I should avoid it I guess.
|
1

I think you'd want something like

df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)

Complete example:

import pandas as pd
import numpy as np

N = 10
data = np.asarray([np.random.rand(N).tolist(), list(range(N))]).T
df = pd.DataFrame(data, columns=['A', 'B'])

df['C'] = df.apply(lambda x: x.A * x.B if x.A > 0.5 else x.A, 1)

1 Comment

Thanks! So my error basically boiled down to an equal sign, haha!
0

I'm sure the solutions provided before this one are better but I solved it a third way. The dataset is rather small so it'll do for now.

multiply = df['A'] * df['B'] df['C'] = multiply.where(df['A'] < 5000, other=df['A'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.