how to filter pandas dataframe by string?

Question

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

EDIT (to add samples):

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description col, so I used:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

Can you add some data sample? Because it should working nice. — jezrael
– jezrael, Commented Dec 29, 2017 at 9:30
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation. — Anton vBR
– Anton vBR, Commented Dec 29, 2017 at 9:56

jezrael · Accepted Answer · 2017-12-29 09:57:36Z

13

It seems you need parameter flags in contains:

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBR is convert to lowercase first:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

Performance really depend on the data - size of DataFrame and number of values matching condition.

edited Dec 29, 2017 at 9:57

answered Dec 29, 2017 at 9:47

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Anton vBR Over a year ago

if case is the issue we could add .lower() after str. right? No need for module?

eh2699 Over a year ago

yes, I had done the import re before, but the ignorecase flag worked perfectly, thank you!

jezrael Over a year ago

I do it with timings but you was faster ;)

Nephilim · Accepted Answer · 2017-12-29 09:32:28Z

6

Keep the string enclosed in quotes.

df[df['col'].str.contains('test')]

Thanks

answered Dec 29, 2017 at 9:32

Nephilim

1308 bronze badges

Comments

Jimmys · Accepted Answer · 2017-12-29 09:34:49Z

-2

It works also OK if you add a condition

df[df['col'].str.contains('test') == True]

answered Dec 29, 2017 at 9:34

Jimmys

3771 gold badge3 silver badges14 bronze badges

2 Comments

eh2699 Over a year ago

yes i tried this as well, but still returns an empty dataframe with col headings... i wonder if it has something to do with the data type?

Anton vBR Over a year ago

Downvotes because conditions in Python are evaluated against == True already making this redundant.

Collectives™ on Stack Overflow

how to filter pandas dataframe by string?

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related