3

I have a dataframe which has 472 columns. Of those 99 columns are dxpoa1, dxpoa2,...,dxpoa99. I want to filter out row(s) of dataframe in which dxpoa columns' values are either 7 or N or BLANK only. dxpoa's can have many values like Y, W,E,1, 7, N etc. Or they remain BLANK. Only those rows in which dxpoa's have either only 7 or N should be filtered out from data frame. Dataset is huge having many hundred thousands rows. Therefore an efficient method will be appreciated.

    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X       
1   Z  W  2      7      7             
2   7  W  N      W      W      1      Z
3   1  7  E      N      N      N      N
4   Y     0      W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
9   N  A  Z      N      N      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

I want row number 1, 3, 9 removed from dataframe.

I have tried many ways like:

df_col = [list of dxpoa column names]
df1 = df[df_col].isin(["Y", "W", "1", "E"]).values

It does not filter out.

4
  • 2
    Do you want to remove rows for which any of the dxpoa columns contain '7' or 'N' or do you want to remove only those rows for which all of those columns contain '7' or 'N'? Commented May 5, 2016 at 17:27
  • I want to remove only those rows for which some or all of those columns may contain '7' or 'N'. If Rows have a mix of 7,N, E, W etc or other than 7, N then I want to keep those rows. In the first line I wrote 'some' since it is not necessary that all dxpoa columns will have values. Many columns are blanks also. These columns are placeholders for values. In some rows only 1 or 2 will have values whereas in some all dxpoa will have values. Commented May 5, 2016 at 17:34
  • It is better to think of blanks as empty strings, '' -- a value just like '7' or 'N'. Then you can express the problem as one of removing rows for which all the values (in the dxpoa columns) are in ['7', 'N', '']. Commented May 5, 2016 at 17:41
  • I updated dataframe supplied by you with exact scenario. I want row 1, 3, 9 removed from dataframe. Commented May 5, 2016 at 17:45

3 Answers 3

3

UPDATE:

you can replace empty strings with NaN, 7 or N and then use isin:

In [196]: df[~df[cols].replace('',np.nan).isin(['7','N', np.nan]).all(axis=1)]
Out[196]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X
2   7  W  N      W      W      1      Z
4   Y  0  W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

OLD answer:

show those containing 7 or N

In [197]: df.loc[df[cols].isin(['7','N']).any(axis=1)]
Out[197]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X
1   Z  W  2      7      7
3   1  7  E      N      N      N      N
4   Y  0  W      N      X      1
5   N  X  1      E      1      Z      7
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
9   N  A  Z      N      N      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1

remove those containing 7 or N

In [198]: df.loc[~df[cols].isin(['7','N']).any(axis=1)]
Out[198]:
    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
2   7  W  N      W      W      1      Z
6   1  X  7      0      A      W      A
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

replace any to all if you want to have/exclude those where all columns should contain either 7 or N

setup:

rows = 15

s = [''] + list('YWE17N0AZX')
df = pd.DataFrame(np.random.choice(s, size=(rows, 7)), columns=list('abc') + ['dxpoa1', 'dxpoa2', 'dxpoa3', 'dxpoa4'])

cols = df.filter(like='dxpoa').columns
Sign up to request clarification or add additional context in comments.

2 Comments

It removes all rows where 7 or N occurs along with other values. I want only those rows where only 7 or N occurs to be removed
@Sanoj, always glad to help
2
  • You could use df.filter(regex=r'^dxpoa') to select columns whose name starts with 'dxpoa', and
  • use .isin(['7','N','']).all(axis=1) to create a boolean mask (for the rows) which is True when all the values in the row are either '7', 'N', or the empty string, '':

For example,

import pandas as pd

df = pd.DataFrame(
    {'a': ['0','Z','7','1','Y','N','1','X','7','N','A','E','E','N','Y'],
     'b': ['A','W','W','7','','X','X','X','1','A','N','N','A','W','1'],
     'c': ['X','2','N','E','0','1','7','Z','A','Z','Z','A','1','Z','A'],
     'dxpoa1': ['W','7','W','N','W','E','0','X','N','N','7','Z','Z','E','W'],
     'dxpoa2': ['N','7','W','N','N','1','A','N','X','N','0','N','E','X','A'],
     'dxpoa3': ['X','','1','N','X','Z','W','A','Z','N','A','N','E','A','E'],
     'dxpoa4': ['','','Z','N','1','7','A','1','N','','E','1','W','0','X']})
mask = df.filter(regex=r'^dxpoa').isin(['7','N','']).all(axis=1)
print(df.loc[~mask])

yields

    a  b  c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0   0  A  X      W      N      X       
2   7  W  N      W      W      1      Z
4   Y     0      W      N      X      1
5   N  X  1      E      1      Z      7
6   1  X  7      0      A      W      A
7   X  X  Z      X      N      A      1
8   7  1  A      N      X      Z      N
10  A  N  Z      7      0      A      E
11  E  N  A      Z      N      N      1
12  E  A  1      Z      E      E      W
13  N  W  Z      E      X      A      0
14  Y  1  A      W      A      E      X

1 Comment

This is a much more flexible solution if you have an undefined number of columns with a common name (i.e. Area_1, Area_2, Area_3,etc.). Cheers!
0

Use apply. If applied function returns boolean it can be used to filter rows like in example below. Note that I didn't try to reproduce your filtering logic.

def analyze_row(r):
   # do whatever you want with column values here
   # return boolean: True - row stays, False - row gone
   ret = False
   if r['dpxoa1'] == 'W':
      ret = True
   return ret

filtered_df = df.ix[df.apply(analyze_row, axis = 1), :]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.