0

I have a large dataset (over 10k columns) whose values fall pretty much within the same range except for some outliers. I need to remove these outliers. Consider the following example:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = np.array([(1,18,1,1,1,1),
                 (1,18,2,3,2,1),
                 (1,22,1,2,2,2),
                 (2,22,3,1,3,1),
                 (1,19,1,10,10,3),
                 (1,22,3,2,1,3),
                 (10,20,3,1,3,10),
                 (2,20,1,3,2,1)])

If i create a per-column boxplot i can clearly see the outliers.

df = pd.DataFrame(data, columns=['a','b','c','d','e','f'])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()

enter image description here

The goal is to iterate through the array, column-wise and remove rows everytime it flags as an outlier for that variable(column). This would remove rows 4/7 and 6/7. I've been trying to make the following work:

for i in range(data.shape[1]):
    mean = np.mean(data[:,i])
    print(mean)
    standard_deviation = np.std(data[:,i])
    print(standard_deviation)
    distance_from_mean = abs(data[:,i] - mean)
    max_deviations = 2
    not_outlier = distance_from_mean < max_deviations * standard_deviation
    data[:,i] = data[:,i][not_outlier]

Which is producing the following error: "ValueError: could not broadcast input array from shape (7) into shape (8)"

My lack of understanding of array indexing i believe is at fault here. Or maybe there is a better way to achieve this?

Thanks in advance!

1 Answer 1

1

First use numpy.any to find the row which contain outliers, then throw them away.

import numpy as np

data = np.array(
    [
        [1, 1, 1, 1, 1, 1],
        [2, 1, 2, 1, 2, 3],
        [1, 3, 1, 2, 2, 2],
        [2, 2, 3, 1, 3, 1],
        [1, 1, 1, 10, 10, 3],
        [1, 2, 3, 2, 1, 3],
        [10, 2, 3, 1, 3, 10],
        [2, 2, 1, 3, 2, 1],
    ]
)

threshold = 5
has_outlier = np.any(data > threshold, axis=1)
data = data[~has_outlier]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Your answer made me realize the problem i have is a bit more precise. I've updated the example to reflect a key issue. Now row with index1 has values that are outliers for the entire array but are not outliers in that specific column. This is why i started with a for-loop iterating each column, to check the mean and std.dev in it and flag rows which contain values above that threshold.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.