Python regex to handle different types of dates

Question

I am trying to write a regex to identify some dates.

the string I am working on is :

string:
'these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000\
 these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012\
 these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.'

The regex looks like :

re.findall('(\
[\b, ]\
([1-9]|0[1-9]|[12][0-9]|3[01])\
[-/.\s+]\
(1[1-2]|0[1-9]|[1-9]|Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sept|September|Oct|October|Nov|November|Dec|December)\
(?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?\
[^\da-zA-Z])',String)

The output I get is :

[(' 11-2-', '11', '2', ''),
 (' 24-3-1695-', '24', '3', '1695'),
 (' 4-02-2011,', '4', '02', '2011'),
 (' 12/12/1990,', '12', '12', '1990'),
 (' 31-11-1690,', '31', '11', '1690'),
 (' 11 July 1990,', '11', 'July', '1990'),
 (' 7 Oct 2012 ', '7', 'Oct', '2012'),
 (' 12 December ', '12', 'December', ''),
 (' 5 July 2001,', '5', 'July', '2001')]

Problems:

The first two output are wrong, they come because of the optional expression ((?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?) put to handle cases like "12 December". How do I get rid of them?
There is a case "June 2000" that is not handles by the expression.
Can I implement something with the expression that could handle this case without affecting others?

A little bit different, I could have commented on the previous post with the additional problems, But thought would be good to have a new question. — Sam
– Sam, Commented Oct 15, 2015 at 10:00

Martin Evans · Accepted Answer · 2018-09-12 08:53:27Z

I would avoid trying to get a regular expression to parse your dates. As you have found, it starts ok but soon becomes harder to catch edge cases, for example invalid dates, e.g. 31/09/2018

A safer approach is to let Python's datetime decide if a date is valid or not. You can then easily specify valid date ranges and allowed date formats.

This script works by using the regular expression to extract all words and number groups. It then takes three parts at a time and applies the allowed date formats. If datetime succeeds in parsing a given format, it is tested to ensure it falls within your allowed date ranges. If valid, the matching parts are skipped over to avoid a second match on a partial date.

If the date found does not contain a year, a default_year is assumed:

from itertools import tee
from datetime import datetime
import re


valid_from = datetime(1920, 1, 1)
valid_to = datetime(2030, 1, 1)
default_year = 2018

dt_formats = [
    ['%d', '%m', '%Y'], 
    ['%d', '%b', '%Y'],
    ['%d', '%B', '%Y'],
    ['%d', '%b'],
    ['%d', '%B'],
    ['%b', '%d'],
    ['%B', '%d'],
    ['%b', '%Y'],
    ['%B', '%Y'],
]

text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

t1, t2, t3 = tee(re.findall(r'\b\w+\b', text), 3)
next(t2, None)
next(t3, None)
next(t3, None)
triples = zip(t1, t2, t3)

for triple in triples:
    for dt_format in dt_formats:
        try:
            dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

            if '%Y' not in dt_format:
                dt = dt.replace(year=default_year)

            if valid_from <= dt <= valid_to:
                print(dt.strftime('%d-%m-%Y'))

                for skip in range(1, len(dt_format)):
                    next(triples)
            break

        except ValueError:
            pass

For the text you have given, this would display:

this is a great answer, I have an edited version in a new answer that returns the original string and index of each match stackoverflow.com/a/71321576/5125264

Matt · Accepted Answer · 2022-03-02 11:03:59Z

@Martin Evans answer was great but I wanted to also return the locations of the match within the string:

>>> text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

>>> find_dates(text)

[('2011-02-04', 90, 99, '4-02-2011'),
 ('1990-12-12', 101, 111, '12/12/1990'),
 ('1990-07-11', 126, 138, '11 July 1990'),
 ('2012-10-07', 140, 150, '7 Oct 2012'),
 ('2022-12-12', 177, 192, '12 December six'),
 ('2000-06-01', 212, 224, 'June 2000 he'),
 ('2001-07-05', 234, 245, '5 July 2001')]

I have wrapped it up in a function and users finditer instead of findall

from itertools import tee
from datetime import datetime
import re

def find_dates(
    text,
    valid_from = datetime(1920, 1, 1),
    valid_to = datetime(2030, 1, 1),
    default_year = datetime.now().year,
    dt_formats = [
        ['%d', '%m', '%Y'], 
        ['%d', '%b', '%Y'],
        ['%d', '%B', '%Y'],
        ['%d', '%b'],
        ['%d', '%B'],
        ['%b', '%d'],
        ['%B', '%d'],
        ['%b', '%Y'],
        ['%B', '%Y'],
    ],
    ):
    # store your matches here
    dates = []
        
    t1, t2, t3 = tee(list(re.finditer(r'\b\w+\b', text)), 3)
    next(t2, None)
    next(t3, None)
    next(t3, None)
    triples = zip(t1, t2, t3)

    for triple in triples:
        # get start and end index of each triple
        start = triple[0].start()
        end = triple[-1].end()

        # convert mathes to a list of three strings
        triple = [text[t.start():t.end()] for t in triple]

        for dt_format in dt_formats:
            try:
                dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

                if '%Y' not in dt_format:
                    dt = dt.replace(year=default_year)

                if valid_from <= dt <= valid_to:
                    dates.append((dt.strftime('%Y-%m-%d'), start, end, text[start:end]))

                    for skip in range(1, len(dt_format)):
                        next(triples)
                break

            except ValueError:
                pass
            
    return dates

There is some bug though as you can see ('2000-06-01', 212, 224, 'June 2000 he'). Although a better approach may be to do something with dateutil.parser.parse like in https://stackoverflow.com/a/33051237/5125264

Rakshith N · Accepted Answer · 2022-01-27 07:58:59Z

0

Use this : r'\d{,2}-[A-Za-z]{,9}-\d{,4}'

import re
re.match(r'\d{,2}\-[A-Za-z]{,9}\-\d{,4}','Your Date')

This can match dates of formats : '14-Jun-2021' , '4-september-20'

answered Jan 27, 2022 at 7:58

Rakshith N

151 silver badge6 bronze badges

1 Comment

Jack Deeth Over a year ago

But it also matches 69-PaNcAkEs-321

Collectives™ on Stack Overflow

Python regex to handle different types of dates

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related