1

I'm trying to replace values in multiple columns of a dataframe with numpy.where in Python by doing the following:

df['X, Y, Z'] = np.where(df['X, Y, Z'] < 1, 0, df['X, Y, Z'])

However, it gives me the following error: KeyError: 'X, Y, Z'

I have already tried doing the strings separately, like 'X', 'Y', 'Z', but it doesn't work either.

How do I resolve this?

1 Answer 1

1

What about passing a proper list of keys ['X', 'Y', 'Z'] to your dataframe instead of a long string 'X, Y, Z':

import numpy as np
import pandas as pd

data = {'X': np.linspace(0,2,8), 'Y': np.linspace(0,2,8)*2, 'Z': np.linspace(0,2,8)*4}

df = pd.DataFrame.from_dict(data)

which gives:

>>> df
>>> 0  0.000000  0.000000  0.000000
>>> 1  0.285714  0.571429  1.142857
>>> 2  0.571429  1.142857  2.285714
>>> 3  0.857143  1.714286  3.428571
>>> 4  1.142857  2.285714  4.571429
>>> 5  1.428571  2.857143  5.714286
>>> 6  1.714286  3.428571  6.857143
>>> 7  2.000000  4.000000  8.0000000

df[['X', 'Y', 'Z']] = np.where(df[['X', 'Y', 'Z']] < 1, 0, df[['X', 'Y', 'Z']])

and now with no longer KeyError:

>>> df
>>> 0  0.000000  0.000000  0.000000
>>> 1  0.000000  0.000000  1.142857
>>> 2  0.000000  1.142857  2.285714
>>> 3  0.000000  1.714286  3.428571
>>> 4  1.142857  2.285714  4.571429
>>> 5  1.428571  2.857143  5.714286
>>> 6  1.714286  3.428571  6.857143
>>> 7  2.000000  4.000000  8.000000
Sign up to request clarification or add additional context in comments.

5 Comments

I'm sorry but this solution seems a bit confusing to me. I find it weird because my method works if I only put 'X' as the column instead of all three, so I don't know why that's happening.
df['X', Y', 'Z'] is not the same as df[['X', 'Y', 'Z']] that is why. In one dimensional case df['X'] is indeed the same as df[['X']]. Python doesn't split 'X, Y, Z' and interpret it as 3 distincts keys nor it does understand 3 distincts keys, but it needs to iterate through list of keys ['X', 'Y', 'Z']one at a time, hope it's clear now.
I understand why my method doesn't work, but the method you use still confuses me. The weird thing is that I remember my code working last week where I used this same method with multiple columns on the same dataframe, so I don't know what I did different now.
Will do. Can you maybe just explain why you choose the values 0,2,8 and the multiplying with 2 and 4 part?
In order to generate small dummy columns of 8 samples that span a range between 0 and 2 where your test condition is meaningful (since you failed to provide a Minimal Reproducible Example). I multiplied by 2 and 4 to get different columns where the condition applied differently. This is arbitrary but shows you that it works for various data :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.