2

I have some really messed up dates that I'm trying to get into a consistent format %Y-%m-%d if it applies. Some of the dates lack the day, some of the dates are in the future or just plain impossible for those I'll just flag as incorrect. How might I tackle such inconsistencies with python?

sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00
3
  • There are some good date parsers that are good at figuring out possible formats, but I think the main problem you're going to run into is that the day and month are often ambiguous. If you have some dates where day comes before month and some where the reverse is true, I'm not sure if there's anything you can do. Commented Jul 1, 2015 at 4:48
  • Right that is the main problem :) Commented Jul 1, 2015 at 4:51
  • When you say "formatting", you mean the opposite "parsing"/"recognizing" inconsistent date formats. Commented Jul 1, 2015 at 15:25

3 Answers 3

3

Some of the values are ambiguous. You can get different result depending on priorities e.g., if you want all dates to be treated consistently; you could specify a list of formats to try:

#!/usr/bin/env python
import re
import sys
from datetime import datetime

for line in sys.stdin:
    date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
    for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
        try:
            print(datetime.strptime(date_string, date_format).date())
            break
        except ValueError:
            pass
    else: # no break
        sys.stderr.write("failed to parse " + line)

Example:

$ python . <input.txt 
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01

You could use other criteria e.g., you could maximize number of dates that are parsed successfully even if some dates are treated inconsistently instead (dateutil, pandas solution might give solutions in this category).

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you this is actually really smart I didn't think about normalizing all delimiter
2

you can use the dateutil parser if you want

from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
    try:
        print parse(d)
    except Exception, err:
        print 'couldn\'t parse', d, err

outputs

1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month

if you would like to flag any that arent an easy parse you can check to see if they have 3 parts to parse and if they do try and parse it or else flag it like so

flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
    try:
        a = None
        for s in splitters:
            if len(d.split(s)) == 3:
                a = parse(d)
                good.append(a)
        if not a:
            raise Exception
    except Exception, err:
        flagged.append(d)

4 Comments

Same thing here the 5th date is that really June 13th, 2015 or is it just June, 2013 that one would need to be flagged somehow
@moku I could easily flag it, but it would depend.. can is it always month name - year?
Well for the most part yes but I see in my example that there is a 7-Dec dunno what that is haha but it should be flagged.
@moku I added a way to flag any dates that could potentially be messed up.. hope its a helpful start :)
1

pd.datetools.to_datetime will have a go at guessing for you, it seems to go ok with most of your your dates, although you might want to put in some additional rules?

df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]: 
0     1997-07-04 00:00:00
1     2002-08-31 00:00:00
2     1995-05-20 00:00:00
3     1992-05-12 00:00:00
4     2015-06-13 00:00:00
5     1998-08-04 00:00:00
6                 90/1/90
7     1977-03-10 00:00:00
8     2015-12-07 00:00:00
9                     NaN
10    1998-04-03 00:00:00
11    1976-08-01 00:00:00
12    1990-03-01 00:00:00
13    2015-09-01 00:00:00
14    1974-04-01 00:00:00
15    2003-10-10 00:00:00
16                 Dec-00
Name: sample, dtype: object

1 Comment

Ok but the 5th date is that really June 13th, 2015 or is it just June, 2013 that one would need to be flagged somehow

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.