1

I have this dataframe and a regex pattern.

df = pd.DataFrame({'a':['base','rhino','gray','horn'],
                   'b':['rhino','elephant', 'gray','trunk'],
                   'c':['cheese','lion', 'beige','mane']})

       a         b       c
0   base     rhino  cheese
1  rhino  elephant    lion
2   gray      gray   beige
3   horn     trunk    mane

I needed to find a row that matched the regex pattern. This worked but then I realized that in production, the previous row also has one of those words ("rhino" in row 0).

pattern = '.*(rhino|elephant|lion)'
row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).any(axis=1)].tolist()

So I need to update this and find the row that has all of those words in the pattern (row 1).

I came up with a solution but it's not the most elegant. Is there a shorter way to achieve this?

def match_pat(df):

    def return_last(x):
        if isinstance(x, list):
            return x[-1]

    a = df.index[df.apply(lambda x: x.str.match('rhino', flags=re.IGNORECASE)).any(axis=1)].tolist()
    b = df.index[df.apply(lambda x: x.str.match('elephant', flags=re.IGNORECASE)).any(axis=1)].tolist()
    c = df.index[df.apply(lambda x: x.str.match('lion', flags=re.IGNORECASE)).any(axis=1)].tolist()

    a = return_last(a)
    b = return_last(b)
    c = return_last(c)

    if a == b == c:
        return a
    else:
        return 0


row_index = match_pat(df)
print(row_index)

The value of row_index is correctly 1. But is there a shorter way? Can you use an "AND" operator in the lambda expression to match each word individually?

2
  • 1
    Do you not just want all instead of any? row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).all(axis=1)].tolist() Or do you want to make sure that all three distinct values are present? Commented Aug 30, 2021 at 19:31
  • 1
    ah yes, I was focused on the regex and didn't think about .all(axis=1). Thanks! Commented Aug 30, 2021 at 20:31

2 Answers 2

1

You can use python sets for that:

words = set(['rhino', 'lion', 'elephant'])
df.apply(lambda r: set(r)==words, axis=1)

output:

0    False
1     True
2    False
3    False

And to subset:

words = set(['rhino', 'lion', 'elephant'])
df[df.apply(lambda r: set(r)==words, axis=1)]

Output:

       a         b     c
1  rhino  elephant  lion
Sign up to request clarification or add additional context in comments.

Comments

1

Use str.extract:

import re

pat = re.compile(r'.*(rhino|elephant|lion)', re.I)

out = df[df.apply(lambda x: x.str.extract(pat).squeeze()).notna().all(axis=1)]
>>> out
       a         b     c
1  rhino  elephant  lion

>>> out.index.tolist()
[1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.