5

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

EDIT (to add samples):

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description col, so I used:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

3
  • Can you add some data sample? Because it should working nice. Commented Dec 29, 2017 at 9:30
  • I just edited the original post to include more detail! Commented Dec 29, 2017 at 9:44
  • 2
    For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation. Commented Dec 29, 2017 at 9:56

3 Answers 3

13

It seems you need parameter flags in contains:

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBR is convert to lowercase first:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

Performance really depend on the data - size of DataFrame and number of values matching condition.

Sign up to request clarification or add additional context in comments.

3 Comments

if case is the issue we could add .lower() after str. right? No need for module?
yes, I had done the import re before, but the ignorecase flag worked perfectly, thank you!
I do it with timings but you was faster ;)
6

Keep the string enclosed in quotes.

df[df['col'].str.contains('test')]

Thanks

Comments

-2

It works also OK if you add a condition

df[df['col'].str.contains('test') == True]

2 Comments

yes i tried this as well, but still returns an empty dataframe with col headings... i wonder if it has something to do with the data type?
Downvotes because conditions in Python are evaluated against == True already making this redundant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.