Python Pandas Dataframe update values efficently

Question

which is the fastest way to achieve the following:

I'm using a Pandas Dataframe (NxN) and i want to iterate over each row and each element to check if the element is greater than the rows mean. If it is greater i want to change the element value to 1.

I calculate the mean value using :

mean_value = df.ix[elementid].mean(axis=0)

but iterating over each element and checking if it is >= mean_value with a nested loop is really slow.

You are accessing every element, what makes you think you can do better than O(nm). — Natecat
– Natecat, Commented Apr 7, 2016 at 17:59
I'm just hoping there is function in pandas to apply the value 1 row-wise if the elements are greater than the mean — J-H
– J-H, Commented Apr 7, 2016 at 18:02
That function would do exactly the same thing as doing it by hand. You are changing every element of the array, therefore you have to access every element of the array. You can't do it faster — Natecat
– Natecat, Commented Apr 7, 2016 at 18:03
I'm doing the loops in python and i thought pandas is partly written in cython or based on libraries which are written in cython and therefore would be faster — J-H
– J-H, Commented Apr 7, 2016 at 18:04

jezrael · Accepted Answer · 2016-04-07 18:54:18Z

6

You can first count mean by rows, then comparing with ge and where mask add 1:

print df
   a  b  c
0  0  1  2
1  0  1  2
2  1  1  2
3  1  0  1
4  1  1  2
5  0  0  1

mean_value = df.mean(axis=1)
print mean_value
0    1.000000
1    1.000000
2    1.333333
3    0.666667
4    1.333333
5    0.333333

mask = df.ge(mean_value, axis=0)
print mask
       a      b     c
0  False   True  True
1  False   True  True
2  False  False  True
3   True  False  True
4  False  False  True
5  False  False  True
print df.mask(mask, 1)
   a  b  c
0  0  1  1
1  0  1  1
2  1  1  1
3  1  0  1
4  1  1  1
5  0  0  1

edited Apr 7, 2016 at 18:54

answered Apr 7, 2016 at 18:03

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zero Over a year ago

That was neat use of mask and ge!

MaxU - stand with Ukraine Over a year ago

very elegant solution +1

Alexander Over a year ago

Looks good except for final result. Don't you just want df.mask(df.gt(df.mean(axis=1)), 1)?

jezrael Over a year ago

Glad can help you! Good luck!

Collectives™ on Stack Overflow

Python Pandas Dataframe update values efficently

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related