Filter pandas dataframe based on values in multiple columns

Question

I have a dataframe which has 472 columns. Of those 99 columns are dxpoa1, dxpoa2,...,dxpoa99. I want to filter out row(s) of dataframe in which dxpoa columns' values are either 7 or N or BLANK only. dxpoa's can have many values like Y, W,E,1, 7, N etc. Or they remain BLANK. Only those rows in which dxpoa's have either only 7 or N should be filtered out from data frame. Dataset is huge having many hundred thousands rows. Therefore an efficient method will be appreciated.

    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X       
1   Z  W  2      7      7             
2   7  W  N      W      W      1      Z
3   1  7  E      N      N      N      N
4   Y     0      W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
9   N  A  Z      N      N      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

I want row number 1, 3, 9 removed from dataframe.

I have tried many ways like:

df_col = [list of dxpoa column names]
df1 = df[df_col].isin(["Y", "W", "1", "E"]).values

It does not filter out.

Do you want to remove rows for which any of the dxpoa columns contain '7' or 'N' or do you want to remove only those rows for which all of those columns contain '7' or 'N'? — unutbu
– unutbu, Commented May 5, 2016 at 17:27
I want to remove only those rows for which some or all of those columns may contain '7' or 'N'. If Rows have a mix of 7,N, E, W etc or other than 7, N then I want to keep those rows. In the first line I wrote 'some' since it is not necessary that all dxpoa columns will have values. Many columns are blanks also. These columns are placeholders for values. In some rows only 1 or 2 will have values whereas in some all dxpoa will have values. — Sanoj
– Sanoj, Commented May 5, 2016 at 17:34
It is better to think of blanks as empty strings, '' -- a value just like '7' or 'N'. Then you can express the problem as one of removing rows for which all the values (in the dxpoa columns) are in ['7', 'N', '']. — unutbu
– unutbu, Commented May 5, 2016 at 17:41
I updated dataframe supplied by you with exact scenario. I want row 1, 3, 9 removed from dataframe. — Sanoj
– Sanoj, Commented May 5, 2016 at 17:45

MaxU - stand with Ukraine · Accepted Answer · 2017-05-06 18:53:48Z

3

UPDATE:

you can replace empty strings with NaN, 7 or N and then use isin:

In [196]: df[~df[cols].replace('',np.nan).isin(['7','N', np.nan]).all(axis=1)]
Out[196]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X
2   7  W  N      W      W      1      Z
4   Y  0  W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

OLD answer:

show those containing 7 or N

In [197]: df.loc[df[cols].isin(['7','N']).any(axis=1)]
Out[197]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X
1   Z  W  2      7      7
3   1  7  E      N      N      N      N
4   Y  0  W      N      X      1
5   N  X  1      E      1      Z      7
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
9   N  A  Z      N      N      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1

remove those containing 7 or N

In [198]: df.loc[~df[cols].isin(['7','N']).any(axis=1)]
Out[198]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
2   7  W  N      W      W      1      Z
6   1  X  7      0      A      W      A
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

replace any to all if you want to have/exclude those where all columns should contain either 7 or N

setup:

rows = 15

s = [''] + list('YWE17N0AZX')
df = pd.DataFrame(np.random.choice(s, size=(rows, 7)), columns=list('abc') + ['dxpoa1', 'dxpoa2', 'dxpoa3', 'dxpoa4'])

cols = df.filter(like='dxpoa').columns

edited May 6, 2017 at 18:53

answered May 5, 2016 at 17:31

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sanoj Over a year ago

It removes all rows where 7 or N occurs along with other values. I want only those rows where only 7 or N occurs to be removed

MaxU - stand with Ukraine Over a year ago

@Sanoj, always glad to help

unutbu · Accepted Answer · 2016-05-05 19:38:50Z

You could use df.filter(regex=r'^dxpoa') to select columns whose name starts with 'dxpoa', and
use .isin(['7','N','']).all(axis=1) to create a boolean mask (for the rows) which is True when all the values in the row are either '7', 'N', or the empty string, '':

For example,

import pandas as pd

df = pd.DataFrame(
    {'a': ['0','Z','7','1','Y','N','1','X','7','N','A','E','E','N','Y'],
     'b': ['A','W','W','7','','X','X','X','1','A','N','N','A','W','1'],
     'c': ['X','2','N','E','0','1','7','Z','A','Z','Z','A','1','Z','A'],
     'dxpoa1': ['W','7','W','N','W','E','0','X','N','N','7','Z','Z','E','W'],
     'dxpoa2': ['N','7','W','N','N','1','A','N','X','N','0','N','E','X','A'],
     'dxpoa3': ['X','','1','N','X','Z','W','A','Z','N','A','N','E','A','E'],
     'dxpoa4': ['','','Z','N','1','7','A','1','N','','E','1','W','0','X']})
mask = df.filter(regex=r'^dxpoa').isin(['7','N','']).all(axis=1)
print(df.loc[~mask])

yields

    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X       
2   7  W  N      W      W      1      Z
4   Y     0      W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

This is a much more flexible solution if you have an undefined number of columns with a common name (i.e. Area_1, Area_2, Area_3,etc.). Cheers!

lowtech · Accepted Answer · 2016-05-05 18:06:28Z

0

Use apply. If applied function returns boolean it can be used to filter rows like in example below. Note that I didn't try to reproduce your filtering logic.

def analyze_row(r):
   # do whatever you want with column values here
   # return boolean: True - row stays, False - row gone
   ret = False
   if r['dpxoa1'] == 'W':
      ret = True
   return ret

filtered_df = df.ix[df.apply(analyze_row, axis = 1), :]

answered May 5, 2016 at 18:06

lowtech

2,6123 gold badges25 silver badges34 bronze badges

Collectives™ on Stack Overflow

Filter pandas dataframe based on values in multiple columns

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related