10

I have a big dataframe and I want to check if any cell contains admin string.

   col1                   col2 ... coln
0   323           roster_admin ... rota_user
1   542  assignment_rule_admin ... application_admin
2   123           contact_user ... configuration_manager
3   235         admin_incident ... incident_user
... ...  ...                   ... ...

I tried to use df.isin(['*admin*']).any() but it seems like isin doesn't support regex. How can I search though all columns using regex?

I have avoided using loops because the dataframe contains over 10 million rows and many columns and the efficiency is important for me.

2
  • check stackoverflow.com/questions/25292838/… Commented Jul 4, 2018 at 10:02
  • df.isin(vals) checks whether the DataFrame/Series values are in the vals. Here vals must be set or list-like. I don't think df.isin(vals) is the natural way to check if a vals contained in a DataFrame column. Commented Jul 4, 2018 at 10:40

2 Answers 2

14

There are two solutions:

  1. df.col.apply method is more straightforward but also a little bit slower:

    In [1]: import pandas as pd
    
    In [2]: import re
    
    In [3]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [4]: df
    Out[4]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [5]: r = re.compile(r'.*(admin).*')
    
    In [6]: df.col2.apply(lambda x: bool(r.match(x)))
    Out[6]: 
    0     True
    1    False
    2    False
    3     True
    4     True
    Name: col2, dtype: bool
    
    In [7]: %timeit -n 100000 df.col2.apply(lambda x: bool(r.match(x)))
    167 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

  1. np.vectorize method require import numpy, but it's more efficient (about 4 times faster in my timeit test).

    In [1]: import numpy as np
    
    In [2]: import pandas as pd
    
    In [3]: import re
    
    In [4]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':['admin', 'aa', 'bb', 'c_admin_d', 'ee_admin']})
    
    In [5]: df
    Out[5]: 
       col1       col2
    0     1      admin
    1     2         aa
    2     3         bb
    3     4  c_admin_d
    4     5   ee_admin
    
    In [6]: r = re.compile(r'.*(admin).*')
    
    In [7]: regmatch = np.vectorize(lambda x: bool(r.match(x)))
    
    In [8]: regmatch(df.col2.values)
    Out[8]: array([ True, False, False,  True,  True])
    
    In [9]: %timeit -n 100000 regmatch(df.col2.values)
    43.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

Since you have changed your question to check any cell, and also concern about time efficiency:

# if you want to check all columns no mater what `dtypes` they are
dfs = df.astype(str, copy=True, errors='raise')
regmatch(dfs.values) # This will return a 2-d array of booleans
regmatch(dfs.values).any() # For existence.

You can still use df.applymap method, but again, it will be slower.

dfs = df.astype(str, copy=True, errors='raise')
r = re.compile(r'.*(admin).*')
dfs.applymap(lambda x: bool(r.match(x))) # This will return a dataframe of booleans.
dfs.applymap(lambda x: bool(r.match(x))).any().any() # For existence.
Sign up to request clarification or add additional context in comments.

1 Comment

what if the value a column is a list, for example 'col2':[['aa','admin'],['admin_b']..]?
8

Try this:

import pandas as pd

df=pd.DataFrame(
    {'col1': [323,542,123,235],
     'col2': ['roster_admin','assignment_rule_admin','contact_user','admin_incident'] ,
    })

df.apply(lambda row: row.astype(str).str.contains('admin').any(), axis=1)

Output:

0     True
1     True
2    False
3     True
dtype: bool

7 Comments

Unfortunately this solution is unacceptably slow for me. I was waiting over 5 minutes and eventually gave up. isin takes around 5 seconds.
@MichaelDz Do we have to check for word admin in only Col 2?
Sorry I didn't mention in the original post that I want to search through all the columns.
@MichaelDz Even check for col1? Which only contains numbers and have dtype as int64? Or you have a dataframe contains mix data types in columns?
I have a mixed data.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.