Numpy way to determine the value(s) in an array which is causing a high variance

Question

Is there a numpy way to determine the value(s) in an array which is causing a high variance?

Consider the set of numbers

array([164, 202, 164, 164, 164, 166], dtype=uint16)

A quick scan reveals, 202 would cause a high variance which if I remove from the list would reduce the variance considerably

>>> np.var(np.array([164, 202, 164, 164, 164, 166]))
196.88888888888886

and removing 202 from the above list would reduce the variance considerably

>>> np.var(np.array([164, 164, 164, 164, 166]))
0.64000000000000012

But, how to determine the offending value?

unutbu · Accepted Answer · 2013-08-17 10:51:09Z

5

Suppose this is your data:

In [19]: import numpy as np
In [167]: x = np.array([164, 202, 164, 164, 164, 166], dtype=np.uint16)

Here is a boolean array indicating which values in x are more than 1 standard deviation away from the mean:

In [170]: abs(x-x.mean()) > x.std()
Out[170]: array([False,  True, False, False, False, False], dtype=bool)

We can use the boolean array as a so-called "fancy index" to retrieve the values which are more than 1 standard deviation away from the mean:

In [171]: x[abs(x-x.mean()) > x.std()]
Out[171]: array([202], dtype=uint16)

Or, reverse the inequality to get the data with the "outliers" removed:

In [172]: x[abs(x-x.mean()) <= x.std()]
Out[172]: array([164, 164, 164, 164, 166], dtype=uint16)

answered Aug 17, 2013 at 10:45

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1