3

I have the following code I think is highly inefficient. Is there a better way to do this type common recoding in pandas?

df['F'] = 0
df['F'][(df['B'] >=3) & (df['C'] >=4.35)] = 1
df['F'][(df['B'] >=3) & (df['C'] < 4.35)] = 2
df['F'][(df['B'] < 3) & (df['C'] >=4.35)] = 3
df['F'][(df['B'] < 3) & (df['C'] < 4.35)] = 4

2 Answers 2

11

Use numpy.select and cache boolean masks to variables for better performance:

m1 = df['B'] >= 3
m2 = df['C'] >= 4.35
m3 = df['C'] < 4.35
m4 = df['B'] < 3

df['F'] = np.select([m1 & m2, m1 & m3, m4 & m2, m4 & m3], [1,2,3,4], default=0)
Sign up to request clarification or add additional context in comments.

1 Comment

Good one! I like it
3

In your specific case, you can make use of the fact that booleans are actually integers (False == 0, True == 1) and use simple arithmetic:

df['F'] = 1 + (df['C'] < 4.35) + 2 * (df['B'] < 3)

Note that this will ignore any NaN's in your B and C columns, these will be assigned as being above your limit.

2 Comments

clever. thanks for the solution. I am looking for a generic solution because we do this type of data processing all the time. sometimes it may not be mathematically aligned as 1, 2, 3, 4.
This answer is in some sense more general, because it is easier to add more columns (by using 4 *, 8 *, etc...) without having to write out all combinations of masks (which grows exponentially).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.