0

Let's say I have a pandas dataframe with string content in its cells.

What's the best way to find a string that matches an specific regex and then return a list of tuples with their respective row and column indexes?

I.e.,

import pandas as pd
mydf = pd.DataFrame({'a':['hello', 'world'], 'b': ['hello', 'folks']})

def findIndex(mydf, regex):
    return regex_indexes

If I do:

regex = r"hello"
findIndex(mydf, regex) # it'd return [(0,0), (0,1)],

If I do:

regex = r"matt"
findIndex(mydf, regex) # it'd return [(-1,-1)],

If I do:

regex = r"folks"
findIndex(mydf, regex) # it'd return [(1,1)], 

I could do a double for loop on the pd.DataFrame but was wondering if other ideas are better...

2
  • A double loop won't be necessary. Wouldn't None be better for no match? Commented Feb 5, 2018 at 19:00
  • @AntonvBR good call, yeah None would also work and probably a better idea Commented Feb 5, 2018 at 19:02

1 Answer 1

4

You can try to use apply, str.match and nonzero.

def findIdx(df, pattern):
    return df.apply(lambda x: x.str.match(pattern)).values.nonzero()

findIdx(mydf, r"hello")
(array([0, 0]), array([0, 1]))
  • df.apply(lambda x: x.str.match(pattern)).values return an array of the same size of df where True indicates matches and False otherwise.

  • We then use nonzero to find the indices of 1(True) part.

It will return the indices that match the pattern in a tuple of arrays. If you need a list of tuples, use list(zip(*findIdx(mydf, r"hello")))

[(0, 0), (0, 1)] 

or np.transpose(findIdx(mydf, r"hello")).


If one needed to return None while nothing is found, one can try

def findIdx(df, pattern):
    ret = df.apply(lambda x: x.str.match(pattern)).values.nonzero()
    return None if len(ret[0]) == 0 else ret

Note: str.match uses re.match under the hook. It will match a string which begins with pattern in this example function. If one wants to find whether a string contains pattern as a substring, use str.contains rather than str.match.

Sign up to request clarification or add additional context in comments.

4 Comments

thanks this is def in the right direction. I have a follow up q though, it doesn't seem to behave as a typical regex. If I do findIdx(pd.DataFrame({'a': ['br', 'a hello,'], 'b':['das','hello']}), r"hel"), it will only match the one on row 1, col 1 but not the one on col 0... any suggestions on how to write the pattern part for more general cases?
@Dnaiel I think perhaps you use str.contains if you want to find whether a string contains a substring.
.to_numpy().nonzero() will be required as of version 0.24.0
FWIW, In a "real" application of this, the thing doesn't work as expected unless NaN in dataframe are coded as false. So I change to return df.apply(lambda x: x.str.match(pattern, na=False)).to_numpy().nonzero().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.