How can I parse multiple (unknown) date formats in python?

Question

I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:

The good thing is I know it is always Month/Day

I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?

Are dates always above 2000, and if not, where should the split between then 1900s and 2000s be? — Tim Pietzcker
– Tim Pietzcker, Commented Aug 13, 2011 at 5:47

Derek 朕會功夫 · Accepted Answer · 2017-07-21 00:06:34Z

28

The third-party module dateutil has a function parse that operates similarly to PHP's strtotime: you don't need to specify a particular date format, it just tries a bunch of its own.

>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0)  # default to be in American date format

It also allows you to specify different assumptions:

dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).

yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).

edited Jul 21, 2017 at 0:06

Derek 朕會功夫

94.8k45 gold badges199 silver badges255 bronze badges

answered Aug 13, 2011 at 6:00

John Flatness

34.1k5 gold badges81 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

eyquem Over a year ago

Does the function parse of dateutil detect dates in a string , or does it need to receive a date as argument ?

CS QGB Over a year ago

parse('2023年12月4日10点',fuzzy=True) datetime.datetime(2010, 12, 4, 0, 0) got wrong result

eyquem · Accepted Answer · 2011-08-13 08:26:53Z

import re

ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''


regx = re.compile('[-/]')
for xd in ss.splitlines():
    m,d,y = regx.split(xd)
    print xd,'   ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))

result

10/02/09     10/02/2009
07/22/09     07/22/2009
09-08-2008     09/08/2008
9/9/2008     09/09/2008
11/4/2010     11/04/2010
03-07-2009     03/07/2009
09/01/2010     09/01/2010

Edit 1

And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

import re
from time import clock
iterat = 100

from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
         ' 03-07-2009', '09/01/2010']

reobj = re.compile(
r"""\s*  # optional whitespace
(\d+)    # Month
[-/]     # separator
(\d+)    # Day
[-/]     # separator
(?:20)?  # century (optional)
(\d+)    # years (YY)
\s*      # optional whitespace""",
re.VERBOSE)

te = clock()
for i in xrange(iterat):
    ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
    fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
               for date in ndates]
print "Tim's method   ",clock()-te,'seconds'



regx = re.compile('[-/]')


te = clock()
for i in xrange(iterat):
    ndates = (reobj.match(date).groups() for date in dates)
    fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'


te = clock()
for i in xrange(iterat):
    ndates = (regx.split(date.strip()) for date in dates)
    fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
              for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'



te = clock()
for i in xrange(iterat):
    fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format   ",clock()-te,'seconds'


print fdates1==fdates2==fdates3==fdates4

result

number of iteration's turns : 100
Tim's method    0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format    0.0153756971906 seconds 
True

The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

I had already used my upvote on your first answer, but I would +1 it again for the performance improvements and testing.

Tim Pietzcker · Accepted Answer · 2011-08-13 06:01:38Z

10

If you don't want to install a third-party module like dateutil:

import re
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
reobj = re.compile(
    r"""\s*  # optional whitespace
    (\d+)    # Month
    [-/]     # separator
    (\d+)    # Day
    [-/]     # separator
    (?:20)?  # century (optional)
    (\d+)    # years (YY)
    \s*      # optional whitespace""", 
    re.VERBOSE)
ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
          for date in ndates]

Result:

['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']

answered Aug 13, 2011 at 6:01

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

2 Comments

eyquem Over a year ago

hello mister @Tim Pietzcker - strptime is a very slow function. See the edit of my answer - Using date for an other object than the class datetime.date isn't very good because it overrides datetime.date. It's not the case in your code but it is risky for the code in which your snippet will be included. - And it's better to make ndates to be a generator

eyquem Over a year ago

@Tim Pietzcker The Tim's + format solution is shorter, clearer and faster than your pure Tim's solution (see the edit in my answer), then.... your solution, though overupvoted, isn't the best, sorry.

JBernardo · Accepted Answer · 2011-08-13 06:05:06Z

4

You can use a regex like r'(\d+)\D(\d+)\D(\d+)' to get the month, day and year in a tuple with the re.findall function.

then just concatenate the 2-digit years with the number 20 or 19 and use the separator you want to join then back:

'/'.join(the_list)

As pointed by Tim:

To normalize days, just do '{0:0>2}'.format(day) and the same to months.

edited Aug 13, 2011 at 6:05

answered Aug 13, 2011 at 5:59

JBernardo

33.6k13 gold badges92 silver badges120 bronze badges

Collectives™ on Stack Overflow

How can I parse multiple (unknown) date formats in python?

4 Answers 4

2 Comments

Edit 1

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Edit 1

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related