31

I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages.

e.g for [1,2,3,4]

from statistics import variance
import numpy as np
print(np.var([1,2,3,4]))
print(variance([1,2,3,4]))
//1.25
//1.6666666666666667

The expected answer of the exercise is calculated with np.var()

Edit I guess it has to do that the later one is sample variance and not variance. Anyone could explain the difference?

2
  • cross-site dupe stats.stackexchange.com/q/17890 Commented Dec 18, 2016 at 0:35
  • 1
    try help(np.var) which will show you the options available for sample and population statistics ... np.var([1,2,3,4], ddof=0) =>1.25 ... and np.var([1,2,3,4], ddof=1) => 1.6666666666666667 Commented Dec 18, 2016 at 0:37

2 Answers 2

40

Use this

print(np.var([1,2,3,4],ddof=1))

1.66666666667

Delta Degrees of Freedom: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default, ddof is zero.

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead.

In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation

For more information refer this documentation : numpy doc

Sign up to request clarification or add additional context in comments.

1 Comment

The last bit is incorrect. numpy uses ddof=0 for both var() and std() by default. Other libraries choose ddof=1 by default for both. I know of no library that uses one convention for variance and another for standard deviation.
1

It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.

2 Comments

source please. I have been looking at examples of the unbiased sample variance being much more accurate over a large sample size.
What do you mean "dividing by N-1 gives an unbiased estimate for the mean"? The mean of the samples is obviously an unbiased estimate of the mean, so dividing the sum of samples by N-1 gives us a biased estimate for the mean.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.