I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically. So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)
pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})'
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|" + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()
I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.
The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe. Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :
df3=df2.copy()
dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","") # we replace the month in the column by its number, and remove
for key,item in dico.items(): # the letters in month after the first 3.
df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')
df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d') # add 01 if no day given
where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them. The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.
Is there a way to discriminate between the date formats and apply different transformations accordingly ?
Thanks a lot