Pandas Dataframe check if a value exists using regex

Question

I have a big dataframe and I want to check if any cell contains admin string.

   col1                   col2 ... coln
0   323           roster_admin ... rota_user
1   542  assignment_rule_admin ... application_admin
2   123           contact_user ... configuration_manager
3   235         admin_incident ... incident_user
... ...  ...                   ... ...

I tried to use df.isin(['*admin*']).any() but it seems like isin doesn't support regex. How can I search though all columns using regex?

I have avoided using loops because the dataframe contains over 10 million rows and many columns and the efficiency is important for me.

df.isin(vals) checks whether the DataFrame/Series values are in the vals. Here vals must be set or list-like. I don't think df.isin(vals) is the natural way to check if a vals contained in a DataFrame column. — YaOzI
– YaOzI, Commented Jul 4, 2018 at 10:40

YaOzI · Accepted Answer · 2018-07-04 14:16:22Z

There are two solutions:

df.col.apply method is more straightforward but also a little bit slower:

In [1]: import pandas as pd

In [2]: import re

In [3]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})

In [4]: df
Out[4]: 
   col1       col2
0     1      admin
1     2         aa
2     3         bb
3     4  c_admin_d
4     5   ee_admin

In [5]: r = re.compile(r'.*(admin).*')

In [6]: df.col2.apply(lambda x: bool(r.match(x)))
Out[6]: 
0     True
1    False
2    False
3     True
4     True
Name: col2, dtype: bool

In [7]: %timeit -n 100000 df.col2.apply(lambda x: bool(r.match(x)))
167 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

np.vectorize method require import numpy, but it's more efficient (about 4 times faster in my timeit test).

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import re

In [4]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})

In [5]: df
Out[5]: 
   col1       col2
0     1      admin
1     2         aa
2     3         bb
3     4  c_admin_d
4     5   ee_admin

In [6]: r = re.compile(r'.*(admin).*')

In [7]: regmatch = np.vectorize(lambda x: bool(r.match(x)))

In [8]: regmatch(df.col2.values)
Out[8]: array([ True, False, False,  True,  True])

In [9]: %timeit -n 100000 regmatch(df.col2.values)
43.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Since you have changed your question to check any cell, and also concern about time efficiency:

# if you want to check all columns no mater what `dtypes` they are
dfs = df.astype(str, copy=True, errors='raise')
regmatch(dfs.values) # This will return a 2-d array of booleans
regmatch(dfs.values).any() # For existence.

You can still use df.applymap method, but again, it will be slower.

dfs = df.astype(str, copy=True, errors='raise')
r = re.compile(r'.*(admin).*')
dfs.applymap(lambda x: bool(r.match(x))) # This will return a dataframe of booleans.
dfs.applymap(lambda x: bool(r.match(x))).any().any() # For existence.

what if the value a column is a list, for example 'col2':[['aa','admin'],['admin_b']..]?

min2bro · Accepted Answer · 2018-07-04 10:01:15Z

8

Try this:

import pandas as pd

df=pd.DataFrame(
    {'col1': [323,542,123,235],
     'col2': ['roster_admin','assignment_rule_admin','contact_user','admin_incident'] ,
    })

df.apply(lambda row: row.astype(str).str.contains('admin').any(), axis=1)

Output:

0     True
1     True
2    False
3     True
dtype: bool

answered Jul 4, 2018 at 10:01

min2bro

4,6385 gold badges33 silver badges55 bronze badges

7 Comments

Michael Dz Over a year ago

Unfortunately this solution is unacceptably slow for me. I was waiting over 5 minutes and eventually gave up. isin takes around 5 seconds.

min2bro Over a year ago

@MichaelDz Do we have to check for word admin in only Col 2?

Michael Dz Over a year ago

Sorry I didn't mention in the original post that I want to search through all the columns.

YaOzI Over a year ago

@MichaelDz Even check for col1? Which only contains numbers and have dtype as int64? Or you have a dataframe contains mix data types in columns?

Michael Dz Over a year ago

I have a mixed data.

|

Collectives™ on Stack Overflow

Pandas Dataframe check if a value exists using regex

2 Answers 2

1 Comment

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related