numpy.sum behaves differently on numpy.array vs pandas.DataFrame

Question

In short, numpy.sum(a, axis=None) sums all cells of an array, but sums over rows of a data frame. I thought that pandas.DataFrame is built on top of numpy.array, and should not have this different behavior? What's the under-the-hood conversion?

a1 = numpy.random.random((3,2))
a2 = pandas.DataFrame(a1)
numpy.sum(a1) # Sums all cells
numpy.sum(a2) # Sums over rows

Odd, it looks like the axis=None is being overridden and set to axis=0 when it goes through the df — EdChum
– EdChum, Commented Mar 1, 2015 at 20:31
OK, I've tracked this down on line 3980: github.com/pydata/pandas/blob/master/pandas/core/generic.py because axis=None it's being assigned to self._stat_axis_number which is 0, hence the difference in behaviour — EdChum
– EdChum, Commented Mar 1, 2015 at 20:43
Starting with pandas 2.0, pandas also does all cells. "For DataFrames, specifying axis=None will apply the aggregation across both axes." pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html — fantabolous
– fantabolous, Commented Jun 12, 2023 at 3:05

EdChum · Accepted Answer · 2015-03-01 20:46:36Z

OK the following is a dump of my pdb debugging session which shows how this ends up in pandas land:

In [*]:

a1 = np.random.random((3,2))
import pdb
a2 = pd.DataFrame(a1)
print(np.sum(a1)) # Sums all cells
pdb.set_trace()
np.sum(a2) # Sums over rows
3.02993889742
--Return--
> <ipython-input-50-92405dd4ed52>(5)<module>()->None
-> pdb.set_trace()
(Pdb) b 6
Breakpoint 2 at <ipython-input-50-92405dd4ed52>:6
(Pdb) c
> <ipython-input-50-92405dd4ed52>(6)<module>()->None
-> np.sum(a2) # Sums over rows
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1623)sum()
-> def sum(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) print(axis)
None
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1700)sum()
-> if isinstance(a, _gentype):
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1706)sum()
-> elif type(a) is not mu.ndarray:
(Pdb) sssssss
*** NameError: name 'sssssss' is not defined
(Pdb) ss
*** NameError: name 'ss' is not defined
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1707)sum()
-> try:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1708)sum()
-> sum = a.sum
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1713)sum()
-> return sum(axis=axis, dtype=dtype, out=out)
(Pdb) print(axis)
None
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3973)stat_func()
-> @Substitution(outname=name, desc=desc)
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3977)stat_func()
-> if skipna is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3978)stat_func()
-> skipna = True
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3979)stat_func()
-> if axis is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3980)stat_func()
-> axis = self._stat_axis_number
(Pdb) print(self._stat_axis_number)
0
(Pdb)

So basically once it ends up in pandas land there are some integrity checks, one of which is that if axis is None then it's assigned the value from self._stat_axis_number which is 0, hence the difference in behaviour. I'm not a pandas dev so they may shed more light on this but this explains the difference in output

In order to achieve the same output you have to call sum twice:

In [6]:

a2.sum(axis=0).sum()
Out[6]:
3.9180334059883006

Or

In [7]:

np.sum(np.sum(a2))
Out[7]:
3.9180334059883006

Thanks for the tracing the source! What should be my mental model about pandas.DataFrame? Hitherto I've thought about it as numpy.array, but given this, it's not always appropriate.
You should think of it as built using numpy arrays in my opinion, also don't expect all numpy functions to behave the same, there will always be corner cases like this

Collectives™ on Stack Overflow

numpy.sum behaves differently on numpy.array vs pandas.DataFrame

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related