2

Is there a numpy way to determine the value(s) in an array which is causing a high variance?

Consider the set of numbers

array([164, 202, 164, 164, 164, 166], dtype=uint16)

A quick scan reveals, 202 would cause a high variance which if I remove from the list would reduce the variance considerably

>>> np.var(np.array([164, 202, 164, 164, 164, 166]))
196.88888888888886

and removing 202 from the above list would reduce the variance considerably

>>> np.var(np.array([164, 164, 164, 164, 166]))
0.64000000000000012

But, how to determine the offending value?

1 Answer 1

5

Suppose this is your data:

In [19]: import numpy as np
In [167]: x = np.array([164, 202, 164, 164, 164, 166], dtype=np.uint16)

Here is a boolean array indicating which values in x are more than 1 standard deviation away from the mean:

In [170]: abs(x-x.mean()) > x.std()
Out[170]: array([False,  True, False, False, False, False], dtype=bool)

We can use the boolean array as a so-called "fancy index" to retrieve the values which are more than 1 standard deviation away from the mean:

In [171]: x[abs(x-x.mean()) > x.std()]
Out[171]: array([202], dtype=uint16)

Or, reverse the inequality to get the data with the "outliers" removed:

In [172]: x[abs(x-x.mean()) <= x.std()]
Out[172]: array([164, 164, 164, 164, 166], dtype=uint16)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.