0

I am having a hard time trying to sort dates with different formats. I have a Series with inputs containing dates in many different formats and need to extract them and sort them chronologically. So far I have setup different regex for fully numerical dates (01/01/1989), dates with month (either Mar 12 1989 or March 1989 or 12 Mar 1989) and dates where only the year is given (see code below)

pat1=r'(\d{0,2}[/-]\d{0,2}[/-]\d{2,4})' # matches mm/dd/yy and mm/dd/yyyy
pat2=r'((\d{1,2})?\W?(Jan|Feb|Mar|Apr|May|June|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\W+(\d{1,2})?\W?\d{4})' 
pat3=r'((?<!\d)(\d{4})(?!\d))'
finalpat=pat1 + "|"+ pat2 + "|"  + pat3
df2=df1.str.extractall(finalpat).groupby(level=0).first()

I now got a dataframe with the different regex expressions above in different columns that I need to transform in usable times.

The problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe. Without two formats ( Month dd YYYY and dd Month YYYY) I can easily do this :

df3=df2.copy()

dico={"Jan":'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06','Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}


df3[1]=df3[1].str.replace("(?<=[A-Z]{1}[a-z]{2})\w*","")  # we replace the month in the column by its number, and remove
for key,item in dico.items():                          # the letters in month after the first 3.
    df3[1]=df3[1].str.replace(key,item)
df3[1]=df3[1].str.replace("^(\d{1,2}/\d{4})",r'01/\g<1>')

df3[1]=pd.to_datetime(df3[1],format='%d/%m/%Y').dt.strftime('%Y%m%d')  # add 01 if no day given

where df3[1] is the column of interest. I use a dictionary to change Month to their number and get my dates as I want them. The problem is that with two formats of dates ( Mar 12 1989 and 12 Mar 1989), one of the two format will be wrongly transformed.

Is there a way to discriminate between the date formats and apply different transformations accordingly ?

Thanks a lot

2
  • 1
    What about normalizing your data first and then process with normalized date format? Commented Jun 21, 2022 at 8:55
  • @MarcinOrlowski is right. If you normalise to YYYYMMDD you can sort lexicographically Commented Jun 21, 2022 at 9:01

1 Answer 1

1

problem I have is that I got dates like Mar 12 1989 and 12 Mar 1989 and Mar 1989 (no day) in the same column of my dataframe.

pandas.to_datetime can cope with that, consider following example

import pandas as pd
df = pd.DataFrame({'d_str':["Mar 12 1989", "12 Mar 1989", "Mar 1989"]})
df['d_dt'] = pd.to_datetime(df.d_str)
print(df)

output

         d_str       d_dt
0  Mar 12 1989 1989-03-12
1  12 Mar 1989 1989-03-12
2     Mar 1989 1989-03-01

Now you can sort using d_dt as it has type datetime64[ns] but you must keep in mind that lack of day is treated as 1st day of given month. Be warned though it might fail if your data contain dates in middle-endian format (mm/dd/yy).

Sign up to request clarification or add additional context in comments.

1 Comment

FML...and also thanks a lot ! I was trying to use pd.to_datetime with specific formating ( %m/%d%Y) which is why my program was failing on half the values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.