How to replace a certain value across a list of dataframe columns with random values efficiently?

Question

I have a dataframe (a very large one) that looks as follows:

id	class_number	a_1	a_2	a_3	a_4
0	1	1	0	0	1
1	1	1	1	0	1
2	1	1	1	1	1
3	1	1	0	2	1
4	1	1	2	0	3

For the sake of completeness, here is a screenshot containing a larger cutout of this dataframe:

How can we replace all ones (all values 1) within the columns a_1 to a_1000 each with a random value other than 0, 1 and 2?

What I tried so far works but seems not to be elegant:

cols = ["a_" + str(i) for i in range(1, 1000+1)]

for col in cols:
    df[col] = df[col].apply(lambda x: random.choice(range(3, 20)) if x == 1 else x)
df.head()

I would be greatful for any hint to implement this in a more staright-forward manner. Note df[cols].apply(...) does not work, since it yields an error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

df[cols].apply(lambda x: x.replace({1:random.choice(range(3, 20))}))? — Onyambu
– Onyambu, Commented Jun 19, 2022 at 18:47
df[cols].apply(lambda x: np.where(x==1, random.choice(range(3,20)), x)) — Onyambu
– Onyambu, Commented Jun 19, 2022 at 18:49

mozway · Accepted Answer · 2022-06-19 18:46:57Z

2

IIUC, you can use:

cols = df.filter(like='a_').columns

df[cols] = df.mask(df[cols].eq(1),
                   np.random.randint(3,1000,(df.shape[0], len(cols))))

reproducible example:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (10,10)),
                  columns=[f'a_{i+1}' for i in range(10)])

output:

   a_1  a_2  a_3  a_4  a_5  a_6  a_7  a_8  a_9  a_10
0    5    0    3    3    7    9    3    5    2     4
1    7    6    8    8  957    6    7    7    8   380
2    5    9    8    9    4    3    0    3    5     0
3    2    3    8  785    3    3    3    7    0    89
4    9    9    0    4    7    3    2    7    2     0
5    0    4    5    5    6    8    4  592    4     9
6    8  773  518    7    9    9    3    6    7     2
7    0    3    5    9    4    4    6    4    4     3
8    4    4    8    4    3    7    5    5    0   846
9    5    9    3    0    5    0   28    2    4     2

answered Jun 19, 2022 at 18:46

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Eldar Sultanow Over a year ago

Thank you for your code, which is very helpful. There seems to be a minor issue. He yields ValueError: other must be the same shape as self when an ndarray at the line df = . But it is a minor issue that is easy to fix.

mozway Over a year ago

weird, it should have the same shape because of (df.shape[0], len(cols))…

Eldar Sultanow Over a year ago

I committed the notebook here to GitHub showing the error. The dataset is publicly available too Kaggle.

Eldar Sultanow Over a year ago

I checked the shapes - they looks to be equal: print(df.shape) and print(df.mask(df[cols].eq(1)).shape) both yield (360591, 1006).

Eldar Sultanow Over a year ago

Yes - that is a very good starting point for me to proceed. Your answer is great and efficient. Thank you again. The last minor issue I will figure out myself. The reason might be that I have a column more, namely class_number which is out of consideration.

Collectives™ on Stack Overflow

How to replace a certain value across a list of dataframe columns with random values efficiently?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related