Pandas dataframe lambda expression: can I use an AND operator in regex?

Question

I have this dataframe and a regex pattern.

df = pd.DataFrame({'a':['base','rhino','gray','horn'],
                   'b':['rhino','elephant', 'gray','trunk'],
                   'c':['cheese','lion', 'beige','mane']})

       a         b       c
0   base     rhino  cheese
1  rhino  elephant    lion
2   gray      gray   beige
3   horn     trunk    mane

I needed to find a row that matched the regex pattern. This worked but then I realized that in production, the previous row also has one of those words ("rhino" in row 0).

pattern = '.*(rhino|elephant|lion)'
row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).any(axis=1)].tolist()

So I need to update this and find the row that has all of those words in the pattern (row 1).

I came up with a solution but it's not the most elegant. Is there a shorter way to achieve this?

def match_pat(df):

    def return_last(x):
        if isinstance(x, list):
            return x[-1]

    a = df.index[df.apply(lambda x: x.str.match('rhino', flags=re.IGNORECASE)).any(axis=1)].tolist()
    b = df.index[df.apply(lambda x: x.str.match('elephant', flags=re.IGNORECASE)).any(axis=1)].tolist()
    c = df.index[df.apply(lambda x: x.str.match('lion', flags=re.IGNORECASE)).any(axis=1)].tolist()

    a = return_last(a)
    b = return_last(b)
    c = return_last(c)

    if a == b == c:
        return a
    else:
        return 0


row_index = match_pat(df)
print(row_index)

The value of row_index is correctly 1. But is there a shorter way? Can you use an "AND" operator in the lambda expression to match each word individually?

Do you not just want all instead of any? row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).all(axis=1)].tolist() Or do you want to make sure that all three distinct values are present? — Henry Ecker
– Henry Ecker ♦, Commented Aug 30, 2021 at 19:31
ah yes, I was focused on the regex and didn't think about .all(axis=1). Thanks! — Chuck
– Chuck, Commented Aug 30, 2021 at 20:31

mozway · Accepted Answer · 2021-08-30 20:09:09Z

1

You can use python sets for that:

words = set(['rhino', 'lion', 'elephant'])
df.apply(lambda r: set(r)==words, axis=1)

output:

0    False
1     True
2    False
3    False

And to subset:

words = set(['rhino', 'lion', 'elephant'])
df[df.apply(lambda r: set(r)==words, axis=1)]

Output:

       a         b     c
1  rhino  elephant  lion

answered Aug 30, 2021 at 20:09

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Corralien · Accepted Answer · 2021-08-30 20:08:57Z

1

Use str.extract:

import re

pat = re.compile(r'.*(rhino|elephant|lion)', re.I)

out = df[df.apply(lambda x: x.str.extract(pat).squeeze()).notna().all(axis=1)]

>>> out
       a         b     c
1  rhino  elephant  lion

>>> out.index.tolist()
[1]

answered Aug 30, 2021 at 20:08

Corralien

121k8 gold badges44 silver badges69 bronze badges

Collectives™ on Stack Overflow

Pandas dataframe lambda expression: can I use an AND operator in regex?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related