I have this dataframe and a regex pattern.
df = pd.DataFrame({'a':['base','rhino','gray','horn'],
'b':['rhino','elephant', 'gray','trunk'],
'c':['cheese','lion', 'beige','mane']})
a b c
0 base rhino cheese
1 rhino elephant lion
2 gray gray beige
3 horn trunk mane
I needed to find a row that matched the regex pattern. This worked but then I realized that in production, the previous row also has one of those words ("rhino" in row 0).
pattern = '.*(rhino|elephant|lion)'
row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).any(axis=1)].tolist()
So I need to update this and find the row that has all of those words in the pattern (row 1).
I came up with a solution but it's not the most elegant. Is there a shorter way to achieve this?
def match_pat(df):
def return_last(x):
if isinstance(x, list):
return x[-1]
a = df.index[df.apply(lambda x: x.str.match('rhino', flags=re.IGNORECASE)).any(axis=1)].tolist()
b = df.index[df.apply(lambda x: x.str.match('elephant', flags=re.IGNORECASE)).any(axis=1)].tolist()
c = df.index[df.apply(lambda x: x.str.match('lion', flags=re.IGNORECASE)).any(axis=1)].tolist()
a = return_last(a)
b = return_last(b)
c = return_last(c)
if a == b == c:
return a
else:
return 0
row_index = match_pat(df)
print(row_index)
The value of row_index is correctly 1. But is there a shorter way? Can you use an "AND" operator in the lambda expression to match each word individually?
allinstead ofany?row_index = df.index[df.apply(lambda x: x.str.match(pattern, flags=re.IGNORECASE)).all(axis=1)].tolist()Or do you want to make sure that all three distinct values are present?