import pandas as pd
data = {'side':['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b', 'a'],
'price':[10400, 10400, 10400, 10380, 1041, 10400, 1041, 10400, 10399, 10399],
'b_d100_v':[1, 1, 1, 0.3, 10, 10, 10, 10, 9, 9],
'b_d100_p':[10390, 10391, 10390, 10390, 10390.5, 10385, 10385, 10386, 10387, 10387],
'a_d052_v':[11, 11, 11, 9.3, 0.1, 0.1, 0.1, 0.1, 0.2, 0.3],
'a_d052_p':[10399, 10403, 10400, 10401, 1041, 1041, 10400, 10400, 10402, 10404]
}
df = pd.DataFrame(data, index=[101, 102, 102, 104, 105, 106, 107, 107, 107, 107])
print(df)
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 a 10400 1.0 10390.0 11.0 10399
102 a 10400 1.0 10391.0 11.0 10403
102 a 10400 1.0 10390.0 11.0 10400
104 b 10380 0.3 10390.0 9.3 10401
105 a 1041 10.0 10390.5 0.1 1041 # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106 a 10400 10.0 10385.0 0.1 1041 # Row to delete (outlier 1041 on 'a_d052_p' column)
107 a 1041 10.0 10385.0 0.1 10400 # Row to delete (outlier 1041 on 'price' column)
107 b 10400 10.0 10386.0 0.1 10400
107 b 10399 9.0 10387.0 0.2 10402
107 a 10399 9.0 10387.0 0.3 10404
I want to delete the rows containing outliers only on the 'price', 'b_d100_p' and 'a_d052_p' columns. To do this, I chose to use a condition based on the standard deviation. Here is the code I tried.
row_with_potential_outliers = ['price', 'b_d100_p', 'a_d052_p']
df = df[abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5] # The value '1.5' here is arbitrary, and does not matter too much
print(df)
side price b_d100_v b_d100_p a_d052_v a_d052_p
101 NaN 10400.0 NaN 10390.0 NaN 10399.0
102 NaN 10400.0 NaN 10391.0 NaN 10403.0
102 NaN 10400.0 NaN 10390.0 NaN 10400.0
104 NaN 10380.0 NaN 10390.0 NaN 10401.0
105 NaN NaN NaN 10390.5 NaN NaN # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106 NaN 10400.0 NaN 10385.0 NaN NaN # Row to delete (outlier 1041 on 'a_d052_p' column)
107 NaN NaN NaN 10385.0 NaN 10400.0 # Row to delete (outlier 1041 on 'price' column)
107 NaN 10400.0 NaN 10386.0 NaN 10400.0
107 NaN 10399.0 NaN 10387.0 NaN 10402.0
107 NaN 10399.0 NaN 10387.0 NaN 10404.0
How to keep the original values of the columns 'side', 'b_d100_v' and 'a_d052_v'? This will allow me to then apply a 'dropna()' to achieve my ends... Or maybe there is a better solution ? I work with dataframes with hundreds of columns and thousands of rows, so I want to avoid iterations as much as possible for performance reasons. Thanks in advance for your help.