Search for String in all Pandas DataFrame columns and filter

Question

Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^') to an entire dataframe at once and filter down to any rows that have records containing the match?

You want to search an entire dataframe rather than just a specific column? — EdChum
– EdChum, Commented Oct 29, 2014 at 20:45
the str.contains method is only valid for Series so you'd have to do something like for col in df: df[col].str.contains('^') — EdChum
– EdChum, Commented Oct 29, 2014 at 20:48

unutbu · Accepted Answer · 2014-10-29 23:38:39Z

92

The Series.str.contains method expects a regex pattern (by default), not a literal string. Therefore str.contains("^") matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("\^") to match the literal ^ character.

To check every column, you could use for col in df to iterate through the column names, and then call str.contains on each column:

mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]

Alternatively, you could pass regex=False to str.contains to make the test use the Python in operator; but (in general) using regex is faster.

edited Oct 29, 2014 at 23:38

answered Oct 29, 2014 at 21:35

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

propjk007 Over a year ago

Hey @unutbu, question for you. Why do you use np.column_stack' when you could use pd.DataFrame(...).transpose()`?

unutbu Over a year ago

When mask is a boolean NumPy array, df.loc[mask] selected rows where the mask is True. If mask is a DataFrame, however, then df.loc[mask] selects rows from df whose index value matches the index value in mask which corresponds to a True value. This alignment of indices is wonderful when you need it, but slows down performance when you don't. So in short, if you don't need the index, use a NumPy array instead of a DataFrame. Also, creating the DataFrame is significantly slower than creating the NumPy array so there is no advantage to using pd.DataFrame([...]).T here.

Zero Over a year ago

@unutbu what do you think about mask = df.apply(lambda x: x.str.contains(r'\^', na=False)) instead of np.column_stack?

unutbu Over a year ago

@Zero: That works fine too. On the plus side, it is a bit shorter to write. On the minus side, it returns a DataFrame instead of a NumPy array. Since we are using mask for indexing, only the array values matter, not any ancillary labels. To make sure that Pandas does not do any unneeded index alignment, I tend to prefer using boolean NumPy arrays over Series for boolean indexing (though really, Pandas does the right thing, so it does not matter). In the end, I think which you use boils down to personal taste.

Owlright Over a year ago

If your df has columns of varying dtypes you need to cast df[col].astype('str') for it to work.

|

Kaushik NP · Accepted Answer · 2017-10-30 08:18:06Z

53

Try with :

df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)

edited Oct 30, 2017 at 8:18

Kaushik NP

6,81510 gold badges33 silver badges60 bronze badges

answered Oct 30, 2017 at 7:36

Puneet Sinha

1,1011 gold badge12 silver badges24 bronze badges

2 Comments

Brad123 Over a year ago

To make your search case-independent: df.apply(lambda row: row.astype(str).str.contains('TEST'.lower(), case=False).any(), axis=1)

restrepo Over a year ago

Very slow compared with the best answer: stackoverflow.com/a/26641085/2268280

rachwa · Accepted Answer · 2022-05-14 16:23:20Z

10

Alternatively you can use eq and any:

df[df.eq('^').any(axis=1)]

answered May 14, 2022 at 16:23

rachwa

2,3901 gold badge21 silver badges20 bronze badges

Comments

thorbjornwolf · Accepted Answer · 2020-09-02 11:15:57Z

9

Here's a function to solve the problem of doing text search in all column of a dataframe df:

def search(regex: str, df, case=False):
    """Search all the text columns of `df`, return rows with any matches."""
    textlikes = df.select_dtypes(include=[object, "string"])
    return df[
        textlikes.apply(
            lambda column: column.str.contains(regex, regex=True, case=case, na=False)
        ).any(axis=1)
    ]

It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function :-)

Relevant docs:

edited Sep 2, 2020 at 11:15

answered Sep 1, 2020 at 8:30

thorbjornwolf

1,88820 silver badges19 bronze badges

Comments

Ciro · Accepted Answer · 2019-06-11 12:58:28Z

3

posting my findings in case someone would need.

i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word 'TOTAL' (any variation eg 'TOTAL PRICE', 'TOTAL STEMS' etc) and delete those rows.

i finally processed the dataframe in two-steps:

FIND COLUMNS THAT CONTAIN THE WORD:

for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')

DELETE THE ROWS:

df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]

answered Jun 11, 2019 at 12:58

Ciro

1071 silver badge7 bronze badges

Comments

n8henrie · Accepted Answer · 2021-09-22 14:45:07Z

2

Yet another solution. This selects for columns of type object, which is Panda's type for strings. Other solutions that coerce to str with .astype(str) could give false positives if you're searching for a number (and want to exclude numeric columns and only search in strings -- but if you want to include searching numeric columns it may be the better approach).

As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35), with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop as compared to 5.72 s ± 155 ms.

df[
    df.select_dtypes(object)
    .apply(lambda row: row.str.contains("with"), axis=1)
    .any(axis=1)
]

answered Sep 22, 2021 at 14:45

n8henrie

3,0553 gold badges31 silver badges47 bronze badges

Comments

Aziz Alto · Accepted Answer · 2022-05-03 22:33:02Z

Building on top of @unutbu's answer https://stackoverflow.com/a/26641085/2839786

I use something like this:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
...     mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
...     return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
    col1   col2   col3
0  hello  today  WORLD
1  world  sunny   NEWS
2    Sun    foo    bar
>>>
>>> search(df, 'sun')
    col1   col2  col3
1  world  sunny  NEWS
2    Sun    foo   bar

Gage Miller · Accepted Answer · 2021-08-08 19:48:52Z

1

Here is an example using applymap. I found other answers didn't work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.

def search(dataFrame, item):
  mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
  return dataFrame[mask]

You can easily change the lambda to use regex if needed.

answered Aug 8, 2021 at 19:48

Gage Miller

111 bronze badge

Collectives™ on Stack Overflow

Search for String in all Pandas DataFrame columns and filter

8 Answers 8

10 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

10 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related