0

I have a pandas DataFrame, with float64's in the 'mass' column. I use np.diff() to find the first difference of this data.

The problem: the size changes if I use data.mass versus, data.mass.values Note, this 'bug' is also seen in the fact that the min, max, and mean are not the same...

import pandas as pd
import numpy as np

data = pd.DataFrame({'time': np.arange(1,101), 'mass': randn(100)})
dm = np.diff(data.mass, n=1)
dmv = np.diff(data.mass.values, n=1)

print 'data.mass: \t\t', dm.shape
print 'min: ', dm.min(), ' max: ', dm.max(), ' mean: ', dm.mean()

print ''
print 'now using data.mass.values in the calculations \n'
print 'data.mass.values: \t', dmv.shape
print 'min: ', dmv.min(), ' max: ', dmv.max(), ' mean: ', dmv.mean()

The output of which is:

data.mass:      (100,)
min:  0.0  max:  0.0  mean:  0.0

now using data.mass.values in the calculations 

data.mass.values:   (99,)
min:  -3.49992599537  max:  2.52901842461  mean:  -0.00718375066572

is this the expected functionality? why would I need to use .value, as I understood pandas DataFrames to be numpy arrays under the hood anyways.

2
  • use data.diff(). Dataframes hold numpy under the hood, and for the most part you can use numpy methods, BUT, np.diff is a not-well behaved function (in fact it violates the numpy guarantees), so it works, but doesn't respond to callers by returning the correct objects. This only shows up in an example just like this. This is 'fixed' in pandas 0.13 (coming very soon), which deals with this problem. Commented Oct 27, 2013 at 0:42
  • @jeff Thanks for the clarification! I'll have to read up on what numpy guarantees entail Commented Oct 27, 2013 at 0:55

1 Answer 1

1

based on @jeff 's comments, using the .diff() method of a pandas DataFrame does give the correct results as shown: So this is clearly just a bad interaction between a numpy method and the current version of pandas. (numpy 1.7.1 for python 2.7 and pandas 0.12.0)

import pandas as pd
import numpy as np

data = pd.DataFrame({'time': np.arange(1,101), 'mass': np.random.randn(100)})
dm = np.diff(data.mass, n=1)
dmv = np.diff(data.mass.values, n=1)

print 'data.mass: \t\t', dm.shape
print 'min: ', dm.min(), ' max: ', dm.max(), ' mean: ', dm.mean()

print ''
print 'now using data.mass.values in the calculations \n'
print 'data.mass.values: \t', dmv.shape
print 'min: ', dmv.min(), ' max: ', dmv.max(), ' mean: ', dmv.mean()

print ''
dm_p = data.mass.diff()
print 'now based on what @jeff said: '
print 'using .diff() : \t', dm_p.shape
print 'min: ', dm_p.min(), ' max: ', dm_p.max(), ' mean: ', dm_p.mean()

This outputs:

data.mass:      (100,)
min:  0.0  max:  0.0  mean:  0.0

now using data.mass.values in the calculations 

data.mass.values:   (99,)
min:  -3.54980400026  max:  3.33045231942  mean:  0.0326969806441

now based on what @jeff said: 
using .diff() :     (100,)
min:  -3.54980400026  max:  3.33045231942  mean:  0.0326969806441

as expected.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.