Remove outliers from numpy array, column wise

Question

I have a large dataset (over 10k columns) whose values fall pretty much within the same range except for some outliers. I need to remove these outliers. Consider the following example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = np.array([(1,18,1,1,1,1),
                 (1,18,2,3,2,1),
                 (1,22,1,2,2,2),
                 (2,22,3,1,3,1),
                 (1,19,1,10,10,3),
                 (1,22,3,2,1,3),
                 (10,20,3,1,3,10),
                 (2,20,1,3,2,1)])

If i create a per-column boxplot i can clearly see the outliers.

df = pd.DataFrame(data, columns=['a','b','c','d','e','f'])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()

The goal is to iterate through the array, column-wise and remove rows everytime it flags as an outlier for that variable(column). This would remove rows 4/7 and 6/7. I've been trying to make the following work:

for i in range(data.shape[1]):
    mean = np.mean(data[:,i])
    print(mean)
    standard_deviation = np.std(data[:,i])
    print(standard_deviation)
    distance_from_mean = abs(data[:,i] - mean)
    max_deviations = 2
    not_outlier = distance_from_mean < max_deviations * standard_deviation
    data[:,i] = data[:,i][not_outlier]

Which is producing the following error: "ValueError: could not broadcast input array from shape (7) into shape (8)"

My lack of understanding of array indexing i believe is at fault here. Or maybe there is a better way to achieve this?

Thanks in advance!

Nico Schlömer · Accepted Answer · 2020-05-12 10:16:08Z

1

First use numpy.any to find the row which contain outliers, then throw them away.

import numpy as np

data = np.array(
    [
        [1, 1, 1, 1, 1, 1],
        [2, 1, 2, 1, 2, 3],
        [1, 3, 1, 2, 2, 2],
        [2, 2, 3, 1, 3, 1],
        [1, 1, 1, 10, 10, 3],
        [1, 2, 3, 2, 1, 3],
        [10, 2, 3, 1, 3, 10],
        [2, 2, 1, 3, 2, 1],
    ]
)

threshold = 5
has_outlier = np.any(data > threshold, axis=1)
data = data[~has_outlier]

answered May 12, 2020 at 10:16

Nico Schlömer

59.7k35 gold badges216 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jcf Over a year ago

Thanks! Your answer made me realize the problem i have is a bit more precise. I've updated the example to reflect a key issue. Now row with index1 has values that are outliers for the entire array but are not outliers in that specific column. This is why i started with a for-loop iterating each column, to check the mean and std.dev in it and flag rows which contain values above that threshold.

Collectives™ on Stack Overflow

Remove outliers from numpy array, column wise

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related