1

I have a dataframe (a very large one) that looks as follows:

id class_number a_1 a_2 a_3 a_4
0 1 1 0 0 1
1 1 1 1 0 1
2 1 1 1 1 1
3 1 1 0 2 1
4 1 1 2 0 3

For the sake of completeness, here is a screenshot containing a larger cutout of this dataframe:

enter image description here

How can we replace all ones (all values 1) within the columns a_1 to a_1000 each with a random value other than 0, 1 and 2?

What I tried so far works but seems not to be elegant:

cols = ["a_" + str(i) for i in range(1, 1000+1)]

for col in cols:
    df[col] = df[col].apply(lambda x: random.choice(range(3, 20)) if x == 1 else x)
df.head()

I would be greatful for any hint to implement this in a more staright-forward manner. Note df[cols].apply(...) does not work, since it yields an error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

5
  • 1
    First include data as text and NOT as images. Commented Jun 19, 2022 at 18:43
  • Sorry - good point. Will fix it immediatly. Commented Jun 19, 2022 at 18:45
  • 1
    df[cols].apply(lambda x: x.replace({1:random.choice(range(3, 20))}))? Commented Jun 19, 2022 at 18:47
  • 1
    df[cols].apply(lambda x: np.where(x==1, random.choice(range(3,20)), x)) Commented Jun 19, 2022 at 18:49
  • Thank you for these hints - are these working inplace? Commented Jun 19, 2022 at 18:55

1 Answer 1

2

IIUC, you can use:

cols = df.filter(like='a_').columns

df[cols] = df.mask(df[cols].eq(1),
                   np.random.randint(3,1000,(df.shape[0], len(cols))))

reproducible example:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (10,10)),
                  columns=[f'a_{i+1}' for i in range(10)])

output:

   a_1  a_2  a_3  a_4  a_5  a_6  a_7  a_8  a_9  a_10
0    5    0    3    3    7    9    3    5    2     4
1    7    6    8    8  957    6    7    7    8   380
2    5    9    8    9    4    3    0    3    5     0
3    2    3    8  785    3    3    3    7    0    89
4    9    9    0    4    7    3    2    7    2     0
5    0    4    5    5    6    8    4  592    4     9
6    8  773  518    7    9    9    3    6    7     2
7    0    3    5    9    4    4    6    4    4     3
8    4    4    8    4    3    7    5    5    0   846
9    5    9    3    0    5    0   28    2    4     2
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your code, which is very helpful. There seems to be a minor issue. He yields ValueError: other must be the same shape as self when an ndarray at the line df = . But it is a minor issue that is easy to fix.
weird, it should have the same shape because of (df.shape[0], len(cols))
I committed the notebook here to GitHub showing the error. The dataset is publicly available too Kaggle.
I checked the shapes - they looks to be equal: print(df.shape) and print(df.mask(df[cols].eq(1)).shape) both yield (360591, 1006).
Yes - that is a very good starting point for me to proceed. Your answer is great and efficient. Thank you again. The last minor issue I will figure out myself. The reason might be that I have a column more, namely class_number which is out of consideration.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.