18

I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:

The good thing is I know it is always Month/Day

10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
 03-07-2009
09/01/2010

I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?

2
  • Are dates always above 2000, and if not, where should the split between then 1900s and 2000s be? Commented Aug 13, 2011 at 5:47
  • Yes, dates are always post 2000 Commented Aug 13, 2011 at 5:53

4 Answers 4

28

The third-party module dateutil has a function parse that operates similarly to PHP's strtotime: you don't need to specify a particular date format, it just tries a bunch of its own.

>>> from dateutil.parser import parse
>>> parse("10/02/09", fuzzy=True)
datetime.datetime(2009, 10, 2, 0, 0)  # default to be in American date format

It also allows you to specify different assumptions:

  • dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
  • yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
Sign up to request clarification or add additional context in comments.

2 Comments

Does the function parse of dateutil detect dates in a string , or does it need to receive a date as argument ?
parse('2023年12月4日10点',fuzzy=True) datetime.datetime(2010, 12, 4, 0, 0) got wrong result
16
import re

ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''


regx = re.compile('[-/]')
for xd in ss.splitlines():
    m,d,y = regx.split(xd)
    print xd,'   ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))

result

10/02/09     10/02/2009
07/22/09     07/22/2009
09-08-2008     09/08/2008
9/9/2008     09/09/2008
11/4/2010     11/04/2010
03-07-2009     03/07/2009
09/01/2010     09/01/2010

Edit 1

And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

import re
from time import clock
iterat = 100

from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
         ' 03-07-2009', '09/01/2010']

reobj = re.compile(
r"""\s*  # optional whitespace
(\d+)    # Month
[-/]     # separator
(\d+)    # Day
[-/]     # separator
(?:20)?  # century (optional)
(\d+)    # years (YY)
\s*      # optional whitespace""",
re.VERBOSE)

te = clock()
for i in xrange(iterat):
    ndates = (reobj.sub(r"\1/\2/20\3", date) for date in dates)
    fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
               for date in ndates]
print "Tim's method   ",clock()-te,'seconds'



regx = re.compile('[-/]')


te = clock()
for i in xrange(iterat):
    ndates = (reobj.match(date).groups() for date in dates)
    fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'


te = clock()
for i in xrange(iterat):
    ndates = (regx.split(date.strip()) for date in dates)
    fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
              for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'



te = clock()
for i in xrange(iterat):
    fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format   ",clock()-te,'seconds'


print fdates1==fdates2==fdates3==fdates4

result

number of iteration's turns : 100
Tim's method    0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format    0.0153756971906 seconds 
True

The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

1 Comment

I had already used my upvote on your first answer, but I would +1 it again for the performance improvements and testing.
10

If you don't want to install a third-party module like dateutil:

import re
from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010', ' 03-07-2009', '09/01/2010']
reobj = re.compile(
    r"""\s*  # optional whitespace
    (\d+)    # Month
    [-/]     # separator
    (\d+)    # Day
    [-/]     # separator
    (?:20)?  # century (optional)
    (\d+)    # years (YY)
    \s*      # optional whitespace""", 
    re.VERBOSE)
ndates = [reobj.sub(r"\1/\2/20\3", date) for date in dates]
fdates = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
          for date in ndates]

Result:

['10/02/2009', '07/22/2009', '09/08/2008', '09/09/2008', '11/04/2010', '03/07/2009', '09/01/2010']

2 Comments

hello mister @Tim Pietzcker - strptime is a very slow function. See the edit of my answer - Using date for an other object than the class datetime.date isn't very good because it overrides datetime.date. It's not the case in your code but it is risky for the code in which your snippet will be included. - And it's better to make ndates to be a generator
@Tim Pietzcker The Tim's + format solution is shorter, clearer and faster than your pure Tim's solution (see the edit in my answer), then.... your solution, though overupvoted, isn't the best, sorry.
4

You can use a regex like r'(\d+)\D(\d+)\D(\d+)' to get the month, day and year in a tuple with the re.findall function.

then just concatenate the 2-digit years with the number 20 or 19 and use the separator you want to join then back:

'/'.join(the_list)

As pointed by Tim:

To normalize days, just do '{0:0>2}'.format(day) and the same to months.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.