0

I have a regex pattern that identifies dates in a whole column of dates, but some of the dates are included in a string, while some are just plain dates by themselves. My regex pattern finds every date perfectly, but now I wanted to be able to say "remove everything that doesn't fit the date pattern" which will get rid of the text that's either in front of or behind some dates.

Example of the stuff I want gone:

Mexico [12/20/1985] If I could remove what doesn't match the pattern, then the brackets and Mexico would go away

Say my regex pattern is (I have two more that match more specific date formats but not including them because that's beside the point:

pattern = (r"(19|20)\d\d")

I'm using has_date = data.str.contains(pattern) and it works perfectly to find what I'm looking for. But, now that I've identified the observations that have the dates that I want, I need to strip/remove/replace with nothing everything that isn't that pattern.

I made a file of what didn't match the regex patterns and what did, and checked to make sure my regex patterns got everything, so I'm good on that front.

Anyone have any suggestions on how to replace what isn't my pattern? Welcome any thoughts. Thanks

2
  • 1
    This sounds like you want to extract texts your pattern matches. Try df['Dates'] = df['Data'].str.extract(r'\b((?:19|20)\d{2})\b', expand=False).fillna('') (if Data is the column with original texts and Dates is the target column). Commented Mar 28, 2019 at 21:33
  • Have you tried that yet? Commented Mar 29, 2019 at 22:18

1 Answer 1

2

To address your exact problem, namely replacing everything not matching the pattern, you may use

df['Data'] = df['Data'].str.replace(r"(?s)((?:19|20)\d\d)?.", r"\1")

See the regex demo.

Here, (?s) will make . match any char, ((?:19|20)\d\d)? is an optional capturing group #1 that matches either 19 or 20 and then any 2 digits 1 or 0 times, and then matches any char with . pattern. If Group 1 matched, it will be put back into the result due to the \1 backreference.

However, it seems you want to just extract the year from the data, and in case there is none, just get an empty string, so use

df['Data'] = df['Data'].str.extract(r'\b((?:19|20)\d{2})\b', expand=False).fillna('')

The \b((?:19|20)\d{2})\b will match 19 or 20 and then any two digits as a whole word (due to \b word boundaries).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you!! Sorry for the late response, but I hadn't been able to get into my account. That solution was very helpful

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.