I have a regex pattern that identifies dates in a whole column of dates, but some of the dates are included in a string, while some are just plain dates by themselves. My regex pattern finds every date perfectly, but now I wanted to be able to say "remove everything that doesn't fit the date pattern" which will get rid of the text that's either in front of or behind some dates.
Example of the stuff I want gone:
Mexico [12/20/1985] If I could remove what doesn't match the pattern, then the brackets and Mexico would go away
Say my regex pattern is (I have two more that match more specific date formats but not including them because that's beside the point:
pattern = (r"(19|20)\d\d")
I'm using has_date = data.str.contains(pattern) and it works perfectly to find what I'm looking for. But, now that I've identified the observations that have the dates that I want, I need to strip/remove/replace with nothing everything that isn't that pattern.
I made a file of what didn't match the regex patterns and what did, and checked to make sure my regex patterns got everything, so I'm good on that front.
Anyone have any suggestions on how to replace what isn't my pattern? Welcome any thoughts. Thanks
df['Dates'] = df['Data'].str.extract(r'\b((?:19|20)\d{2})\b', expand=False).fillna('')(ifDatais the column with original texts andDatesis the target column).