Pandas vs. Numpy Dataframes

Question

Look at these few lines of code:

df2 = df.copy()
df2[1:] = df[1:]/df[:-1].values -1
df2.ix[0, :] = 0

Our instructor said we need to use the .values attribute to access the underlying numpy array, otherwise, our code wouldn't work.

I understand that a pandas DataFrame does have an underlying representation as a numpy array, but I didn't understand why we cannot operate directly on the pandas DataFrame using just slicing.

May you elucidate me about that?

user2285236 · Accepted Answer · 2017-05-07 15:25:15Z

8

pandas focuses on tabular data structures and when doing the operations (addition, subtraction etc.) it looks at the labels - not positions.

Consider the following DataFrame:

df = pd.DataFrame(np.random.randn(5, 3), index=list('abcde'), columns=list('xyz'))

Here, df[1:] is:

df[1:]
Out: 
          x         y         z
b  1.003035  0.172960  1.160033
c  0.117608 -1.114294 -0.557413
d -1.312315  1.171520 -1.034012
e -0.380719 -0.422896  1.073535

And df[:-1] is:

df[:-1]
Out: 
          x         y         z
a  1.367916  1.087607 -0.625777
b  1.003035  0.172960  1.160033
c  0.117608 -1.114294 -0.557413
d -1.312315  1.171520 -1.034012

If you do df[1:] / df[:-1] it will divide row b's by row b's, row c's by row c's and row d's by row d's. For row a and e, it will not be able to find corresponding rows in the other DataFrame (either in the first one or in the second one) so it will return nan:

df[1:] / df[:-1]
Out: 
     x    y    z
a  NaN  NaN  NaN
b  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0
e  NaN  NaN  NaN

If you just want to do element-wise division ignoring the labels, accessing the underlying numpy array by .values for one of the frames is a way of telling pandas to ignore labels. Since numpy arrays don't have labels, pandas will just do element-wise operations:

df[1:]/df[:-1].values
Out: 
           x         y         z
b   0.733258  0.159028 -1.853749
c   0.117252 -6.442482 -0.480515
d -11.158359 -1.051357  1.855018
e   0.290112 -0.360981 -1.038223

answered May 7, 2017 at 15:25

user2285236

Sign up to request clarification or add additional context in comments.

2 Comments

MadHatter Over a year ago

Now, I understand that the final result would have been the same, but I wonder if it would have been more formally correct to use a numpy array for the numerator, too..

user2285236 Over a year ago

In that case, the whole operation will be in numpy so it will return an array without labels. Note that in the final output (df[1:]/df[:-1].values) the result is a DataFrame. So it will be based on your needs.

Collectives™ on Stack Overflow

Pandas vs. Numpy Dataframes

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related