42

I am trying to filter a df using several Boolean variables that are a part of the df, but have been unable to do so.

Sample data:

A | B | C | D
John Doe | 45 | True | False
Jane Smith | 32 | False | False
Alan Holmes | 55 | False | True
Eric Lamar | 29 | True | True

The dtype for columns C and D is Boolean. I want to create a new df (df1) with only the rows where either C or D is True. It should look like this:

A | B | C | D
John Doe | 45 | True | False
Alan Holmes | 55 | False | True
Eric Lamar | 29 | True | True

I've tried something like this, which faces issues because it cant handle the Boolean type:

df1 = df[(df['C']=='True') or (df['D']=='True')]

Any ideas?

1
  • 4
    the bool type should be referenced unquoted unless it's stored as a string Commented Sep 13, 2017 at 22:08

5 Answers 5

82
In [82]: d
Out[82]:
             A   B      C      D
0     John Doe  45   True  False
1   Jane Smith  32  False  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 1:

In [83]: d.loc[d.C | d.D]
Out[83]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 2:

In [94]: d[d[['C','D']].any(1)]
Out[94]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 3:

In [95]: d.query("C or D")
Out[95]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

PS If you change your solution to:

df[(df['C']==True) | (df['D']==True)]

it'll work too

Pandas docs - boolean indexing


The following example shows why we should NOT use "PEP complaint" df["col_name"] is True instead of df["col_name"] == True (issue mentioned in comments).

In [11]: df = pd.DataFrame({"col":[True, True, True]})

In [12]: df
Out[12]:
    col
0  True
1  True
2  True

In [13]: df["col"] is True
Out[13]: False               # <----- oops, that's not exactly what we wanted
Sign up to request clarification or add additional context in comments.

9 Comments

Man, I think most of questioners were spoiled by us ...:) Nice solution
Wow so much simpler than I had expected, thank you! I ended up using solution 3 because I actually had 4 boolean variables in my actual dataset and that one was the neatest - worked like a charm! I didn't realize that bools worked like that, i.e. that I didn't to define the content of the bool (1/0, True/False) and that it automatically assumes True. Thanks again!
df[(df.var == True)].count() fires E712 (use is True instead of ==) for me
@kev, thank you for your advise, but (df['C'] is True) will always return False, independently from the data. I don't think this will improve my answer ;)
@kev, i wouldn't intentionally break the code (df["col_name"] is True - produces a wrong result) just to be PEP complaint. I have added an example in the answer. BTW there are three options in my answer and all of them are PEP complaint ;)
|
15

Hooray! More options!

np.where

df[np.where(df.C | df.D, True, False)]

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True  

pd.Series.where on df.index

df.loc[df.index.where(df.C | df.D).dropna()]

               A   B      C      D
0.0     John Doe  45   True  False
2.0  Alan Holmes  55  False   True
3.0   Eric Lamar  29   True   True

df.select_dtypes

df[df.select_dtypes([bool]).any(1)]   

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Abusing np.select

df.iloc[np.select([df.C | df.D], [df.index])].drop_duplicates()

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

2 Comments

@Wen Most of these solutions suck, but I had fun finding them. I think np.where is the best one.
Cool! I'd never think of "np.select" +1 :)
5

Or

d[d.eval('C or D')]

Out[1065]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

1 Comment

Eval, this was my first option! (had to look for another one after this)
1

So, the easiest way to do this:

students = [ ('jack1', 'Apples1' , 341) ,
             ('Riti1', 'Mangos1'  , 311) ,
             ('Aadi1', 'Grapes1' , 301) ,
             ('Sonia1', 'Apples1', 321) ,
             ('Lucy1', 'Mangos1'  , 331) ,
             ('Mike1', 'Apples1' , 351),
              ('Mik', 'Apples1' , np.nan)
              ]
#Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name1' , 'Product1', 'Sale1']) 
print(df)


    Name1 Product1  Sale1
0   jack1  Apples1    341
1   Riti1  Mangos1    311
2   Aadi1  Grapes1    301
3  Sonia1  Apples1    321
4   Lucy1  Mangos1    331
5   Mike1  Apples1    351
6     Mik  Apples1    NaN

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’,
subset = df[df['Product1'] == 'Apples1']
print(subset)

 Name1 Product1  Sale1
0   jack1  Apples1    341
3  Sonia1  Apples1    321
5   Mike1  Apples1    351
6     Mik  Apples1    NA

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND notnull value in Sale

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'].notnull())]
print(subsetx)
    Name1   Product1    Sale1
0   jack1   Apples1      341
3   Sonia1  Apples1      321
5   Mike1   Apples1      351

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND Sale = 351

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'] == 351)]
print(subsetx)

   Name1 Product1  Sale1
5  Mike1  Apples1    351

# Another example
subsetData = df[df['Product1'].isin(['Mangos1', 'Grapes1']) ]
print(subsetData)

Name1 Product1  Sale1
1  Riti1  Mangos1    311
2  Aadi1  Grapes1    301
4  Lucy1  Mangos1    331

Here is the source of this code: https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/
I added minor changes to it.

Comments

-2

you could try this easily:

df1 = df[(df['C']=='True') | (df['D']=='True')]

Note:

  1. The or logical operator needs to be replaced by the bitwise | operator.
  2. Ensure that () are used to enclose each of the operands.

1 Comment

This is similar to answers already given half a year before and doesn't work. Please try out on the example data the next time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.