Remove everything that doesn't match regex patterns in Python

Question

I have a regex pattern that identifies dates in a whole column of dates, but some of the dates are included in a string, while some are just plain dates by themselves. My regex pattern finds every date perfectly, but now I wanted to be able to say "remove everything that doesn't fit the date pattern" which will get rid of the text that's either in front of or behind some dates.

Example of the stuff I want gone:

Mexico [12/20/1985] If I could remove what doesn't match the pattern, then the brackets and Mexico would go away

Say my regex pattern is (I have two more that match more specific date formats but not including them because that's beside the point:

pattern = (r"(19|20)\d\d")

I'm using has_date = data.str.contains(pattern) and it works perfectly to find what I'm looking for. But, now that I've identified the observations that have the dates that I want, I need to strip/remove/replace with nothing everything that isn't that pattern.

I made a file of what didn't match the regex patterns and what did, and checked to make sure my regex patterns got everything, so I'm good on that front.

Anyone have any suggestions on how to replace what isn't my pattern? Welcome any thoughts. Thanks

This sounds like you want to extract texts your pattern matches. Try df['Dates'] = df['Data'].str.extract(r'\b((?:19|20)\d{2})\b', expand=False).fillna('') (if Data is the column with original texts and Dates is the target column). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 28, 2019 at 21:33

Wiktor Stribiżew · Accepted Answer · 2019-04-01 20:57:53Z

2

To address your exact problem, namely replacing everything not matching the pattern, you may use

df['Data'] = df['Data'].str.replace(r"(?s)((?:19|20)\d\d)?.", r"\1")

See the regex demo.

Here, (?s) will make . match any char, ((?:19|20)\d\d)? is an optional capturing group #1 that matches either 19 or 20 and then any 2 digits 1 or 0 times, and then matches any char with . pattern. If Group 1 matched, it will be put back into the result due to the \1 backreference.

However, it seems you want to just extract the year from the data, and in case there is none, just get an empty string, so use

df['Data'] = df['Data'].str.extract(r'\b((?:19|20)\d{2})\b', expand=False).fillna('')

The \b((?:19|20)\d{2})\b will match 19 or 20 and then any two digits as a whole word (due to \b word boundaries).

answered Apr 1, 2019 at 20:57

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hapigolucki Over a year ago

Thank you!! Sorry for the late response, but I hadn't been able to get into my account. That solution was very helpful

Collectives™ on Stack Overflow

Remove everything that doesn't match regex patterns in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related