Difference between numpy var() and pandas var()

Question

I recently encountered a thing which made me notice that numpy.var() and pandas.DataFrame.var() or pandas.Series.var() are giving different values. I want to know if there is any difference between them or not?

Here is my dataset.


     Country    GDP     Area    Continent
0      India    2.79    3.287   Asia
1      USA     20.54    9.840   North America
2      China    13.61   9.590   Asia

Here is my code:


from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])

Now checking Pandas Variance

# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())

The output is

GDP     1.5
Area    1.5
dtype: float64
GDP     1.5
Area    1.5
dtype: float64
1.5000000000000002
1.5000000000000002

Whereas it should be 1 as I have used StandardScaler on it.

And for numpy Variance

print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())

THe output is

1.0000000000000002
1.0000000000000002
1.0000000000000002

Which seems correct.

pandas var has ddof of 1 by default, numpy has it at 0. Try catDf.iloc[:,1:-1].var(ddof=0). — Dan
– Dan, Commented Jul 16, 2020 at 15:50
Yes it works. Write an answer for it, I will tick it. THanks — Ahmad Anis
– Ahmad Anis, Commented Jul 16, 2020 at 16:00

Dan · Accepted Answer · 2020-07-16 16:03:30Z

7

pandas var has ddof of 1 by default, numpy has it at 0.

The get the same var in pandas as you're getting in numpy do

catDf.iloc[:,1:-1].var(ddof=0)

This comes down to the difference between population variance and sample variance.

Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.

answered Jul 16, 2020 at 16:03

Dan

45.8k20 gold badges98 silver badges170 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Difference between numpy var() and pandas var()

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related