Filtering pandas dataframe rows based on a multiple column condition

Question

import pandas as pd

data = {'side':['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b', 'a'],
        'price':[10400, 10400, 10400, 10380, 1041, 10400, 1041, 10400, 10399, 10399],
        'b_d100_v':[1, 1, 1, 0.3, 10, 10, 10, 10, 9, 9],
        'b_d100_p':[10390, 10391, 10390, 10390, 10390.5, 10385, 10385, 10386, 10387, 10387],
        'a_d052_v':[11, 11, 11, 9.3, 0.1, 0.1, 0.1, 0.1, 0.2, 0.3],
        'a_d052_p':[10399, 10403, 10400, 10401, 1041, 1041, 10400, 10400, 10402, 10404]
        }
df = pd.DataFrame(data, index=[101, 102, 102, 104, 105, 106, 107, 107, 107, 107])

print(df)
    side  price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400       1.0   10390.0      11.0     10399
102    a  10400       1.0   10391.0      11.0     10403
102    a  10400       1.0   10390.0      11.0     10400
104    b  10380       0.3   10390.0       9.3     10401
105    a   1041      10.0   10390.5       0.1      1041  # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106    a  10400      10.0   10385.0       0.1      1041  # Row to delete (outlier 1041 on 'a_d052_p' column)
107    a   1041      10.0   10385.0       0.1     10400  # Row to delete (outlier 1041 on 'price' column)
107    b  10400      10.0   10386.0       0.1     10400
107    b  10399       9.0   10387.0       0.2     10402
107    a  10399       9.0   10387.0       0.3     10404

I want to delete the rows containing outliers only on the 'price', 'b_d100_p' and 'a_d052_p' columns. To do this, I chose to use a condition based on the standard deviation. Here is the code I tried.

row_with_potential_outliers = ['price', 'b_d100_p', 'a_d052_p']
df = df[abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5]  # The value '1.5' here is arbitrary, and does not matter too much

print(df)
    side    price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101  NaN  10400.0       NaN   10390.0       NaN   10399.0
102  NaN  10400.0       NaN   10391.0       NaN   10403.0
102  NaN  10400.0       NaN   10390.0       NaN   10400.0
104  NaN  10380.0       NaN   10390.0       NaN   10401.0
105  NaN      NaN       NaN   10390.5       NaN       NaN  # Row to delete (outlier 1041 on 'price' and 'a_d052_p' columns)
106  NaN  10400.0       NaN   10385.0       NaN       NaN  # Row to delete (outlier 1041 on 'a_d052_p' column)
107  NaN      NaN       NaN   10385.0       NaN   10400.0  # Row to delete (outlier 1041 on 'price' column)
107  NaN  10400.0       NaN   10386.0       NaN   10400.0
107  NaN  10399.0       NaN   10387.0       NaN   10402.0
107  NaN  10399.0       NaN   10387.0       NaN   10404.0

How to keep the original values of the columns 'side', 'b_d100_v' and 'a_d052_v'? This will allow me to then apply a 'dropna()' to achieve my ends... Or maybe there is a better solution ? I work with dataframes with hundreds of columns and thousands of rows, so I want to avoid iterations as much as possible for performance reasons. Thanks in advance for your help.

Valdi_Bo · Accepted Answer · 2020-09-26 14:25:01Z

First, to use proper names, set the name of your list with columns to check just as cols_to_check:

cols_to_check = ['price', 'b_d100_p', 'a_d052_p']

Note that rows with outliers are not found yet. You specify only which columns to check for outliers.

Then define a function to look for outliers in particular column, with the given threshold:

def isOutlier(col, thr):
    return ((col - col.mean()) / col.std()).abs() > thr

And to drop rows with outliers, run:

df = df[~df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)]

The result is:

    side  price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400       1.0   10390.0      11.0     10399
102    a  10400       1.0   10391.0      11.0     10403
102    a  10400       1.0   10390.0      11.0     10400
104    b  10380       0.3   10390.0       9.3     10401
107    b  10400      10.0   10386.0       0.1     10400
107    b  10399       9.0   10387.0       0.2     10402
107    a  10399       9.0   10387.0       0.3     10404

To fully comprehend how this code works:

To see the full result of check whether any element is an outlier, within its column, run:

 df[cols_to_check].apply(isOutlier, thr=1.5)

You apply isOutlier function to each column from cols_to_check list, passing the chosen value of the threshold. The result is:

      price  b_d100_p  a_d052_p
 101  False     False     False
 102  False     False     False
 102  False     False     False
 104  False     False     False
 105   True     False      True
 106  False     False      True
 107   True     False     False
 107  False     False     False
 107  False     False     False
 107  False     False     False

To get the cumulative result for each row (whether any element in a row is True), run:

 df[cols_to_check].apply(isOutlier, thr=1.5).any(axis=1)

The result is:

 101    False
 102    False
 102    False
 104    False
 105     True
 106     True
 107     True
 107    False
 107    False
 107    False
 dtype: bool

So if you run df[~...] you will get rows where the above condition is False and just this result should be saved back in df.

Note that you want to delete the rows containing al least one outlier, not change the outlier value to NaN (as it was the result of your code).

Can Cakiroglu · Accepted Answer · 2020-09-26 13:48:42Z

1

To make your method give the desired output, you just need to apply the outlier removal to the specified columns of the dataframe df

df[row_with_potential_outliers] = df[row_with_potential_outliers][abs((df[row_with_potential_outliers] - df[row_with_potential_outliers].mean()) / df[row_with_potential_outliers].std()) < 1.5]

    side    price  b_d100_v  b_d100_p  a_d052_v  a_d052_p
101    a  10400.0       1.0   10390.0      11.0   10399.0
102    a  10400.0       1.0   10391.0      11.0   10403.0
102    a  10400.0       1.0   10390.0      11.0   10400.0
104    b  10380.0       0.3   10390.0       9.3   10401.0
105    a      NaN      10.0   10390.5       0.1       NaN
106    a  10400.0      10.0   10385.0       0.1       NaN
107    a      NaN      10.0   10385.0       0.1   10400.0
107    b  10400.0      10.0   10386.0       0.1   10400.0
107    b  10399.0       9.0   10387.0       0.2   10402.0
107    a  10399.0       9.0   10387.0       0.3   10404.0

answered Sep 26, 2020 at 13:48

Can Cakiroglu

563 bronze badges

1 Comment

David Over a year ago

This solution works, but I realize that it has a limit. It should not be that the original dataframe already has 'NaN' values. For this reason, I prefer Valdi_Bo's solution.

Collectives™ on Stack Overflow

Filtering pandas dataframe rows based on a multiple column condition

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related