Pandas filtering rows with regex pattern present in the row itself

Question

I have a pandas dataframe containing 2 columns. One containing regex pattern and the other having actual string. I want to filter out the rows where the pattern column and actual data comply with each other.

My data is in a csv file and it looks like below.

pattern,data
1234.*,abcd
567_.*,567_hello

I am expecting the output data frame to be as shown below.

pattern,data
567_.*,567_hello

I tried using lambda function on each row of DataFrame. But of no use.

df[df.apply(lambda row: re.compile(row[0]).match(row[1]))]
df[df.apply(lambda row: re.compile(row[0].str).match(row[1].str))]
df[df.apply(lambda row: re.compile(row['pattern']).match(row['data']))]

I could achieve this by constructing an all new DataFrame by iterating and filtering then. But it's not an efficient way to iterate dataframe. I am trying to work towards a more pythonic approach.

Is producing a boolean list and then using that acceptable enough? eg: m = [bool(re.match(p, d)) for p, d in zip(df['pattern'], df['data'])] then do df[m] to get the matches? — Jon Clements
– Jon Clements, Commented Nov 12, 2019 at 11:53
@JonClements I don't want to extend the dataframe. The dataframe is already huge with millions of records. — BarathVutukuri
– BarathVutukuri, Commented Nov 12, 2019 at 11:54
How is that extending a DataFrame? Also - millions of bools isn't exactly generally prohibitive... — Jon Clements
– Jon Clements, Commented Nov 12, 2019 at 11:59

Shohruh Abduakhatov · Accepted Answer · 2019-11-12 12:02:50Z

1

After a bit of modification, here is the result:

df[df.apply(lambda row: re.compile(row['pattern']).match(row['data']) is not None, axis=1)]

answered Nov 12, 2019 at 12:02

Shohruh Abduakhatov

941 silver badge11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BarathVutukuri Over a year ago

Thanks, It worked. May I know why my code didn't work? What's the reason for adding is not None condition?

Dhruv Marwha Over a year ago

@BarathVutukuri I believe the is not None condition is used to Return only positive matches where the pattern returns a successful match . Other Places where the pattern does not Match would be a Empty Space(Which is basically a NaN).

Shohruh Abduakhatov Over a year ago

@BarathVutukuri apart from is not None , axis 1 (row) should be stated as well

Collectives™ on Stack Overflow

Pandas filtering rows with regex pattern present in the row itself

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related