How to iterate through numpy array and remove anomalies?

Question

I am a beginner with Python and programming in general. I am trying to write a program that iterates through a specific numpy array, and detects anomalies within the dataset (the definition of an anomaly is any point that is greater than 3 times the standard deviation from the mean WITHOUT the data point). I need to recalculate the mean and standard deviation for each time an anomalous data point is removed.

I have written the below code, but noticed a couple of issues. After the loop is iterated through once, it states that the value of 160 is removed, but when I print new_array, I still see 160 in the array.

Also, how could I recalculate the new mean for each time a data point is removed? I feel like something is just positioned incorrectly within the for loop. And finally is my use of continue correct or should it be placed elsewhere?

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

for cell in data_array:
    mean = np.mean(data_array, axis=0)
    sd = np.std(data_array, axis=0)
    lower_anomaly_point = mean - (3 * sd)
    upper_anomaly_point = mean + (3 * sd)
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        new_array = np.delete(data_array, cell)
        continue

The problem with your current code (even if you fix the errors) is that the values, that have been checked first may lie outside the new margin, after deleting an entry. This means that your code may not always return correct results. I would advise you to do these steps with a while loop: As long as there is an outlier, that is outside the 3*std margin: 1. Find the outlier, that is farthest away from mean and delete it 2. Calculate new mean and std — markuscosinus
– markuscosinus, Commented Feb 25, 2019 at 12:53

damagedCoda · Accepted Answer · 2019-02-25 12:49:33Z

I think you should see Numpy Documentation and refer to the first line where they specifically say that it returns all the elements that don't conform with arr[obj], this means that numpy.delete() works in an index based manner. I would suggest you edit your code so as to get the index of that cell and then pass it onto np.delete()

Following is the edited code:

import numpy as np

data_array = np.array([99.5697438, 94.47019021, 55.0, 106.86672855, 102.78730151, 131.85777845, 88.25376895, 96.94439838, 83.67782174, 115.57993209, 118.97651966, 94.40479467, 79.63342207, 77.88602065, 96.59145004, 99.50145353, 97.25980235, 87.72010069, 101.30597215, 87.3110369, 110.0687946, 104.71504012, 89.34719772, 160.0, 110.61519268, 112.94716398, 104.41867586])
print(data_array)
for cell in data_array:
    mean = np.mean(data_array, axis=0)
    sd = np.std(data_array, axis=0)
    lower_anomaly_point = mean - (3 * sd)
    upper_anomaly_point = mean + (3 * sd)
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        index=np.where(data_array==cell)
        new_array = np.delete(data_array, obj=index)
        continue

score 1 · Accepted Answer · 2019-02-25 12:54:05Z

As @damagedcoda say your main error is you should use index instead the value, but you will have new problem if you will recalculate the lower_anomaly_point and upper_anomaly_point inside cycle. So i recommend you to try the np.where to solve your task:

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)

data_array = data_array[
    np.where(
        (upper_anomaly_point > data_array) & (data_array > lower_anomaly_point)
    )]

and result is:

array([ 99.5697438 ,  94.47019021,  55.        , 106.86672855,
       102.78730151, 131.85777845,  88.25376895,  96.94439838,
        83.67782174, 115.57993209, 118.97651966,  94.40479467,
        79.63342207,  77.88602065,  96.59145004,  99.50145353,
        97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
       110.0687946 , 104.71504012,  89.34719772, 110.61519268,
       112.94716398, 104.41867586])

Michał Piotr Stankiewicz · Accepted Answer · 2019-02-25 12:45:54Z

That code fails for me. The data_array does not change, np.delete returns new array, it does not change old one. You do not use new_array in any place of the code, you probably wanted to calculated mean from new_array The second argument for delete should be index, "indicates which subarray to remove". you cannot use cell.

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
new_array = data_array.copy()
k = 0

for i, cell in enumerate(data_array):
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        new_array = np.delete(new_array, i - k)
        k += 1

new_array is data_array without 160. as you wished

Collectives™ on Stack Overflow

How to iterate through numpy array and remove anomalies?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related