1

I want to make a new column based on two variables. I want my new column to have the value "Good" if (column 1 >= .5 or column 2 < 0.5) and (column 1 < .5 or column 2 >= 0.5) otherwise "Bad".

I tried using lambda and if.

df["new column"] = df[["column 1", "column 2"]].apply(
    lambda x, y: "Good" if (x >= 0.5 or y < 0.5) and (x < 0.5 or y >= 0.5) else "Bad"
)

Got

TypeError: ("() missing 1 required positional argument: 'y'", 'occurred at index column 1')

4 Answers 4

5

Use np.where, pandas does intrinsic data alignment, meaning you don't need to use apply or iterate row by row, pandas will align the data on index:

df['new column'] = df['new column'] = np.where(((df['y'] <= .5) | (df['x'] > .5)) & ((df['x'] < .5) | (df['y'] >= .5)), 'Good', 'Bad')
df

Using @YunaA. setup....

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 0.1, 0.1], 
                   'y': [1, 2, 0.7, 0.2], 
                   'column 3': [1, 2, 3, 4]})

df['new column'] = df['new column'] = np.where(((df['y'] <= .5) | (df['x'] > .5)) & ((df['x'] < .5) | (df['y'] >= .5)), 'Good', 'Bad')
df

Output:

     x    y  column 3 new column
0  1.0  1.0         1       Good
1  2.0  2.0         2       Good
2  0.1  0.7         3        Bad
3  0.1  0.2         4       Good

Timings:

import pandas as pd
import numpy as np

np.random.seed(123)
df = pd.DataFrame({'x':np.random.random(100)*2, 
                   'y': np.random.random(100)*1})
def update_column(row):
    if (row['x'] >= .5 or row['y'] <= .5) and (row['x'] < .5 or row['y'] >= .5):
        return "Good"
    return "Bad"  

Results

%timeit df['new column'] = np.where(((df['y'] <= .5) | (df['x'] > .5))
& ((df['x'] < .5) | (df['y'] >= .5)), 'Good', 'Bad')

1.45 ms ± 72.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['new_column'] = df.apply(update_column, axis=1)

5.83 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sign up to request clarification or add additional context in comments.

Comments

2

Try this:

import pandas as pd 

def update_column(row):
    if (row['x'] >= .5 or row['y'] <= .5) and (row['x'] < .5 or row['y'] >= .5):
        return "Good"
    return "Bad"

df['new_column'] = df.apply(update_column, axis=1)

3 Comments

Why loop if there is a vectorised option?
Sure, there are a few different ways to solve this problem.
But looping is generally a lot slower, and apply is harfly faster than a python loop. Here the DataFrame.where method is faster and as expressive. In the longer run it also pays off to get to know the tools
2

Pass the row into the lambda instead.

df['new column'] = df[['column 1', 'column 2']].apply(lambda row: "Good" if (row['column 1'] >= .5 or row['column 2'] < .5) and (row['column 1'] < .5 or row['column 2'] >= .5) else "Bad", axis=1)

Example:

import pandas as pd

df = pd.DataFrame({'column 1': [1, 2, 0.1, 0.1], 
                   'column 2': [1, 2, 0.7, 0.2], 
                   'column 3': [1, 2, 3, 4]})
df['new column'] = df[['column 1', 'column 2']].apply(lambda row: "Good" if (row['column 1'] >= .5 or row['column 2'] < .5) and (row['column 1'] < .5 or row['column 2'] >= .5) else "Bad", axis=1)

print(df)

Output:

   column 1  column 2  column 3 new column
0       1.0       1.0         1       Good
1       2.0       2.0         2       Good
2       0.1       0.7         3        Bad
3       0.1       0.2         4       Good

Comments

0

You just need to reference the columns by their index in the array you are passing the the lambda expression, like this:

df["new column"] = df[["column 1", "column 2"]].apply(
    lambda x: "Good" if (x[0] >= 0.5 or x[1] < 0.5) and (x[0] < 0.5 or x[1] >= 0.5) else "Bad", axis=1
)

NOTE: don't forget to include axis=1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.