numpy.diff() use with pandas DataFrame error

Question

I have a pandas DataFrame, with float64's in the 'mass' column. I use np.diff() to find the first difference of this data.

The problem: the size changes if I use data.mass versus, data.mass.values Note, this 'bug' is also seen in the fact that the min, max, and mean are not the same...

import pandas as pd
import numpy as np

data = pd.DataFrame({'time': np.arange(1,101), 'mass': randn(100)})
dm = np.diff(data.mass, n=1)
dmv = np.diff(data.mass.values, n=1)

print 'data.mass: \t\t', dm.shape
print 'min: ', dm.min(), ' max: ', dm.max(), ' mean: ', dm.mean()

print ''
print 'now using data.mass.values in the calculations \n'
print 'data.mass.values: \t', dmv.shape
print 'min: ', dmv.min(), ' max: ', dmv.max(), ' mean: ', dmv.mean()

The output of which is:

data.mass:      (100,)
min:  0.0  max:  0.0  mean:  0.0

now using data.mass.values in the calculations 

data.mass.values:   (99,)
min:  -3.49992599537  max:  2.52901842461  mean:  -0.00718375066572

is this the expected functionality? why would I need to use .value, as I understood pandas DataFrames to be numpy arrays under the hood anyways.

use data.diff(). Dataframes hold numpy under the hood, and for the most part you can use numpy methods, BUT, np.diff is a not-well behaved function (in fact it violates the numpy guarantees), so it works, but doesn't respond to callers by returning the correct objects. This only shows up in an example just like this. This is 'fixed' in pandas 0.13 (coming very soon), which deals with this problem. — Jeff
– Jeff, Commented Oct 27, 2013 at 0:42
@jeff Thanks for the clarification! I'll have to read up on what numpy guarantees entail — not link
– not link, Commented Oct 27, 2013 at 0:55

not link · Accepted Answer · 2013-10-27 01:04:08Z

based on @jeff 's comments, using the .diff() method of a pandas DataFrame does give the correct results as shown: So this is clearly just a bad interaction between a numpy method and the current version of pandas. (numpy 1.7.1 for python 2.7 and pandas 0.12.0)

import pandas as pd
import numpy as np

data = pd.DataFrame({'time': np.arange(1,101), 'mass': np.random.randn(100)})
dm = np.diff(data.mass, n=1)
dmv = np.diff(data.mass.values, n=1)

print 'data.mass: \t\t', dm.shape
print 'min: ', dm.min(), ' max: ', dm.max(), ' mean: ', dm.mean()

print ''
print 'now using data.mass.values in the calculations \n'
print 'data.mass.values: \t', dmv.shape
print 'min: ', dmv.min(), ' max: ', dmv.max(), ' mean: ', dmv.mean()

print ''
dm_p = data.mass.diff()
print 'now based on what @jeff said: '
print 'using .diff() : \t', dm_p.shape
print 'min: ', dm_p.min(), ' max: ', dm_p.max(), ' mean: ', dm_p.mean()

This outputs:

data.mass:      (100,)
min:  0.0  max:  0.0  mean:  0.0

now using data.mass.values in the calculations 

data.mass.values:   (99,)
min:  -3.54980400026  max:  3.33045231942  mean:  0.0326969806441

now based on what @jeff said: 
using .diff() :     (100,)
min:  -3.54980400026  max:  3.33045231942  mean:  0.0326969806441

as expected.

Collectives™ on Stack Overflow

numpy.diff() use with pandas DataFrame error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related