How to calculate the standard deviation from a histogram? (Python, Matplotlib)

Question

Let's say I have a data set and used matplotlib to draw a histogram of said data set.

n, bins, patches = plt.hist(data, normed=1)

How do I calculate the standard deviation, using the n and bins values that hist() returns? I'm currently doing this to calculate the mean:

s = 0
for i in range(len(n)):
   s += n[i] * ((bins[i] + bins[i+1]) / 2) 
mean = s / numpy.sum(n)

which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:

t = 0
for i in range(len(n)):
  t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))

my results are way off from what numpy.std(data) returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n and bins values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.

I have access to it, but the assignment explicitly states that I'm not supposed to use the original data. I think the whole wording ("These values are very useful for computing the mean, variance or other attributes of your distribution.") is what confused me, since it didn't mention anything about the results being only approximations. :) — justaname
– justaname, Commented Jun 10, 2018 at 18:53

Warren Weckesser · Accepted Answer · 2018-06-10 18:41:04Z

19

You haven't weighted the contribution of each bin with n[i]. Change the increment of t to

    t += n[i]*(bins[i] - mean)**2

By the way, you can simplify (and speed up) your calculation by using numpy.average with the weights argument.

Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.

In [54]: x = np.random.normal(loc=10, scale=2, size=1000)

In [55]: x.mean()
Out[55]: 9.9760798903061847

In [56]: x.var()
Out[56]: 3.7673459904902025

In [57]: x.std()
Out[57]: 1.9409652213499866

I'll use numpy.histogram to compute the histogram:

In [58]: n, bins = np.histogram(x)

mids is the midpoints of the bins; it has the same length as n:

In [59]: mids = 0.5*(bins[1:] + bins[:-1])

The estimate of the mean is the weighted average of mids:

In [60]: mean = np.average(mids, weights=n)

In [61]: mean
Out[61]: 9.9763028267760312

In this case, it is pretty close to the mean of the original data.

The estimated variance is the weighted average of the squared difference from the mean:

In [62]: var = np.average((mids - mean)**2, weights=n)

In [63]: var
Out[63]: 3.8715035807387328

In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677

That estimate is within 2% of the actual sample standard deviation.

edited Jun 10, 2018 at 18:41

answered Jun 10, 2018 at 18:30

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

justaname Over a year ago

Thanks, totally forgot that! However, my results are still a bit inaccurate (something like 0.19 vs 0.17 with numpy). Am I right to assume that you can only get an approximate value for the standard deviation from a histogram, or is there something else I'm missing?

Warren Weckesser Over a year ago

That's right, you can't expect the the values computed using the histogram to match the values computed using the full data set. The histogram loses information.

Keto · Accepted Answer · 2021-05-11 22:59:16Z

6

The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:

$\mathbb E[X] = \sum_{x \in X} p(X) x$

counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)

mean = np.sum(probs * mids)  
sd = np.sqrt(np.sum(probs * (mids - mean)**2))

Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.

edited May 11, 2021 at 22:59

answered Aug 7, 2019 at 18:24

Keto

2,3481 gold badge20 silver badges29 bronze badges

Collectives™ on Stack Overflow

How to calculate the standard deviation from a histogram? (Python, Matplotlib)

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related