1

In short, numpy.sum(a, axis=None) sums all cells of an array, but sums over rows of a data frame. I thought that pandas.DataFrame is built on top of numpy.array, and should not have this different behavior? What's the under-the-hood conversion?

a1 = numpy.random.random((3,2))
a2 = pandas.DataFrame(a1)
numpy.sum(a1) # Sums all cells
numpy.sum(a2) # Sums over rows
4
  • Odd, it looks like the axis=None is being overridden and set to axis=0 when it goes through the df Commented Mar 1, 2015 at 20:31
  • 1
    OK, I've tracked this down on line 3980: github.com/pydata/pandas/blob/master/pandas/core/generic.py because axis=None it's being assigned to self._stat_axis_number which is 0, hence the difference in behaviour Commented Mar 1, 2015 at 20:43
  • 2
    Why not use numpy.sum(a2.values)? Commented Mar 2, 2015 at 7:07
  • Starting with pandas 2.0, pandas also does all cells. "For DataFrames, specifying axis=None will apply the aggregation across both axes." pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html Commented Jun 12, 2023 at 3:05

1 Answer 1

1

OK the following is a dump of my pdb debugging session which shows how this ends up in pandas land:

In [*]:

a1 = np.random.random((3,2))
import pdb
a2 = pd.DataFrame(a1)
print(np.sum(a1)) # Sums all cells
pdb.set_trace()
np.sum(a2) # Sums over rows
3.02993889742
--Return--
> <ipython-input-50-92405dd4ed52>(5)<module>()->None
-> pdb.set_trace()
(Pdb) b 6
Breakpoint 2 at <ipython-input-50-92405dd4ed52>:6
(Pdb) c
> <ipython-input-50-92405dd4ed52>(6)<module>()->None
-> np.sum(a2) # Sums over rows
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1623)sum()
-> def sum(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) print(axis)
None
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1700)sum()
-> if isinstance(a, _gentype):
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1706)sum()
-> elif type(a) is not mu.ndarray:
(Pdb) sssssss
*** NameError: name 'sssssss' is not defined
(Pdb) ss
*** NameError: name 'ss' is not defined
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1707)sum()
-> try:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1708)sum()
-> sum = a.sum
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1713)sum()
-> return sum(axis=axis, dtype=dtype, out=out)
(Pdb) print(axis)
None
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3973)stat_func()
-> @Substitution(outname=name, desc=desc)
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3977)stat_func()
-> if skipna is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3978)stat_func()
-> skipna = True
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3979)stat_func()
-> if axis is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3980)stat_func()
-> axis = self._stat_axis_number
(Pdb) print(self._stat_axis_number)
0
(Pdb) 

So basically once it ends up in pandas land there are some integrity checks, one of which is that if axis is None then it's assigned the value from self._stat_axis_number which is 0, hence the difference in behaviour. I'm not a pandas dev so they may shed more light on this but this explains the difference in output

In order to achieve the same output you have to call sum twice:

In [6]:

a2.sum(axis=0).sum()
Out[6]:
3.9180334059883006

Or

In [7]:

np.sum(np.sum(a2))
Out[7]:
3.9180334059883006
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the tracing the source! What should be my mental model about pandas.DataFrame? Hitherto I've thought about it as numpy.array, but given this, it's not always appropriate.
You should think of it as built using numpy arrays in my opinion, also don't expect all numpy functions to behave the same, there will always be corner cases like this

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.