find column and row index on specific regex match on a pandas dataframe

Question

Let's say I have a pandas dataframe with string content in its cells.

What's the best way to find a string that matches an specific regex and then return a list of tuples with their respective row and column indexes?

I.e.,

import pandas as pd
mydf = pd.DataFrame({'a':['hello', 'world'], 'b': ['hello', 'folks']})

def findIndex(mydf, regex):
    return regex_indexes

If I do:

regex = r"hello"
findIndex(mydf, regex) # it'd return [(0,0), (0,1)],

If I do:

regex = r"matt"
findIndex(mydf, regex) # it'd return [(-1,-1)],

If I do:

regex = r"folks"
findIndex(mydf, regex) # it'd return [(1,1)],

I could do a double for loop on the pd.DataFrame but was wondering if other ideas are better...

A double loop won't be necessary. Wouldn't None be better for no match? — Anton vBR
– Anton vBR, Commented Feb 5, 2018 at 19:00
@AntonvBR good call, yeah None would also work and probably a better idea — Dnaiel
– Dnaiel, Commented Feb 5, 2018 at 19:02

Tai · Accepted Answer · 2018-02-05 20:06:30Z

4

You can try to use apply, str.match and nonzero.

def findIdx(df, pattern):
    return df.apply(lambda x: x.str.match(pattern)).values.nonzero()

findIdx(mydf, r"hello")
(array([0, 0]), array([0, 1]))

df.apply(lambda x: x.str.match(pattern)).values return an array of the same size of df where True indicates matches and False otherwise.
We then use nonzero to find the indices of 1(True) part.

It will return the indices that match the pattern in a tuple of arrays. If you need a list of tuples, use list(zip(*findIdx(mydf, r"hello")))

[(0, 0), (0, 1)]

or np.transpose(findIdx(mydf, r"hello")).

If one needed to return None while nothing is found, one can try

def findIdx(df, pattern):
    ret = df.apply(lambda x: x.str.match(pattern)).values.nonzero()
    return None if len(ret[0]) == 0 else ret

Note: str.match uses re.match under the hook. It will match a string which begins with pattern in this example function. If one wants to find whether a string contains pattern as a substring, use str.contains rather than str.match.

edited Feb 5, 2018 at 20:06

answered Feb 5, 2018 at 19:25

Tai

8,0643 gold badges31 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dnaiel Over a year ago

thanks this is def in the right direction. I have a follow up q though, it doesn't seem to behave as a typical regex. If I do findIdx(pd.DataFrame({'a': ['br', 'a hello,'], 'b':['das','hello']}), r"hel"), it will only match the one on row 1, col 1 but not the one on col 0... any suggestions on how to write the pattern part for more general cases?

Tai Over a year ago

@Dnaiel I think perhaps you use str.contains if you want to find whether a string contains a substring.

leopardxpreload Over a year ago

.to_numpy().nonzero() will be required as of version 0.24.0

pauljohn32 Over a year ago

FWIW, In a "real" application of this, the thing doesn't work as expected unless NaN in dataframe are coded as false. So I change to return df.apply(lambda x: x.str.match(pattern, na=False)).to_numpy().nonzero().

Collectives™ on Stack Overflow

find column and row index on specific regex match on a pandas dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related