What is the difference between numpy var() and statistics variance() in python?

Question

I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages.

e.g for [1,2,3,4]

from statistics import variance
import numpy as np
print(np.var([1,2,3,4]))
print(variance([1,2,3,4]))
//1.25
//1.6666666666666667

The expected answer of the exercise is calculated with np.var()

Edit I guess it has to do that the later one is sample variance and not variance. Anyone could explain the difference?

try help(np.var) which will show you the options available for sample and population statistics ... np.var([1,2,3,4], ddof=0) =>1.25 ... and np.var([1,2,3,4], ddof=1) => 1.6666666666666667 — NaN
– NaN, Commented Dec 18, 2016 at 0:37

camilajenny · Accepted Answer · 2019-01-08 15:34:50Z

40

Use this

print(np.var([1,2,3,4],ddof=1))

1.66666666667

Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is zero.

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead.

In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation

For more information refer this documentation : numpy doc

edited Jan 8, 2019 at 15:34

camilajenny

5,2567 gold badges37 silver badges70 bronze badges

answered Dec 18, 2016 at 0:40

FallAndLearn

4,1351 gold badge20 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Robert Kern Over a year ago

The last bit is incorrect. numpy uses ddof=0 for both var() and std() by default. Other libraries choose ddof=1 by default for both. I know of no library that uses one convention for variance and another for standard deviation.

Andrew Cameron Morris · Accepted Answer · 2019-06-30 15:18:40Z

1

It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.

answered Jun 30, 2019 at 15:18

Andrew Cameron Morris

392 bronze badges

2 Comments

ELI7VH Over a year ago

source please. I have been looking at examples of the unbiased sample variance being much more accurate over a large sample size.

Zaz Over a year ago

What do you mean "dividing by N-1 gives an unbiased estimate for the mean"? The mean of the samples is obviously an unbiased estimate of the mean, so dividing the sum of samples by N-1 gives us a biased estimate for the mean.

Collectives™ on Stack Overflow

What is the difference between numpy var() and statistics variance() in python?

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related