I have a pandas DataFrame, with float64's in the 'mass' column. I use np.diff() to find the first difference of this data.
The problem: the size changes if I use data.mass versus, data.mass.values Note, this 'bug' is also seen in the fact that the min, max, and mean are not the same...
import pandas as pd
import numpy as np
data = pd.DataFrame({'time': np.arange(1,101), 'mass': randn(100)})
dm = np.diff(data.mass, n=1)
dmv = np.diff(data.mass.values, n=1)
print 'data.mass: \t\t', dm.shape
print 'min: ', dm.min(), ' max: ', dm.max(), ' mean: ', dm.mean()
print ''
print 'now using data.mass.values in the calculations \n'
print 'data.mass.values: \t', dmv.shape
print 'min: ', dmv.min(), ' max: ', dmv.max(), ' mean: ', dmv.mean()
The output of which is:
data.mass: (100,)
min: 0.0 max: 0.0 mean: 0.0
now using data.mass.values in the calculations
data.mass.values: (99,)
min: -3.49992599537 max: 2.52901842461 mean: -0.00718375066572
is this the expected functionality? why would I need to use .value, as I understood pandas DataFrames to be numpy arrays under the hood anyways.
data.diff(). Dataframes hold numpy under the hood, and for the most part you can use numpy methods, BUT,np.diffis a not-well behaved function (in fact it violates the numpy guarantees), so it works, but doesn't respond to callers by returning the correct objects. This only shows up in an example just like this. This is 'fixed' in pandas 0.13 (coming very soon), which deals with this problem.