4

I am trying to write a regex to identify some dates.

the string I am working on is :

string:
'these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000\
 these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012\
 these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave.'

The regex looks like :

re.findall('(\
[\b, ]\
([1-9]|0[1-9]|[12][0-9]|3[01])\
[-/.\s+]\
(1[1-2]|0[1-9]|[1-9]|Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sept|September|Oct|October|Nov|November|Dec|December)\
(?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?\
[^\da-zA-Z])',String)

The output I get is :

[(' 11-2-', '11', '2', ''),
 (' 24-3-1695-', '24', '3', '1695'),
 (' 4-02-2011,', '4', '02', '2011'),
 (' 12/12/1990,', '12', '12', '1990'),
 (' 31-11-1690,', '31', '11', '1690'),
 (' 11 July 1990,', '11', 'July', '1990'),
 (' 7 Oct 2012 ', '7', 'Oct', '2012'),
 (' 12 December ', '12', 'December', ''),
 (' 5 July 2001,', '5', 'July', '2001')]

Problems:

  1. The first two output are wrong, they come because of the optional expression ((?:[-/.\s+](1[0-9]\d\d|20[0-2][0-5]))?) put to handle cases like "12 December". How do I get rid of them?

  2. There is a case "June 2000" that is not handles by the expression.
    Can I implement something with the expression that could handle this case without affecting others?

2
  • 3
    stackoverflow.com/questions/33143433/… isnt this same? Commented Oct 15, 2015 at 9:57
  • A little bit different, I could have commented on the previous post with the additional problems, But thought would be good to have a new question. Commented Oct 15, 2015 at 10:00

3 Answers 3

2

I would avoid trying to get a regular expression to parse your dates. As you have found, it starts ok but soon becomes harder to catch edge cases, for example invalid dates, e.g. 31/09/2018

A safer approach is to let Python's datetime decide if a date is valid or not. You can then easily specify valid date ranges and allowed date formats.

This script works by using the regular expression to extract all words and number groups. It then takes three parts at a time and applies the allowed date formats. If datetime succeeds in parsing a given format, it is tested to ensure it falls within your allowed date ranges. If valid, the matching parts are skipped over to avoid a second match on a partial date.

If the date found does not contain a year, a default_year is assumed:

from itertools import tee
from datetime import datetime
import re


valid_from = datetime(1920, 1, 1)
valid_to = datetime(2030, 1, 1)
default_year = 2018

dt_formats = [
    ['%d', '%m', '%Y'], 
    ['%d', '%b', '%Y'],
    ['%d', '%B', '%Y'],
    ['%d', '%b'],
    ['%d', '%B'],
    ['%b', '%d'],
    ['%B', '%d'],
    ['%b', '%Y'],
    ['%B', '%Y'],
]

text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

t1, t2, t3 = tee(re.findall(r'\b\w+\b', text), 3)
next(t2, None)
next(t3, None)
next(t3, None)
triples = zip(t1, t2, t3)

for triple in triples:
    for dt_format in dt_formats:
        try:
            dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

            if '%Y' not in dt_format:
                dt = dt.replace(year=default_year)

            if valid_from <= dt <= valid_to:
                print(dt.strftime('%d-%m-%Y'))

                for skip in range(1, len(dt_format)):
                    next(triples)
            break

        except ValueError:
            pass

For the text you have given, this would display:

04-02-2011
12-12-1990
11-07-1990
07-10-2012
12-12-2018
01-06-2000
05-07-2001
Sign up to request clarification or add additional context in comments.

1 Comment

this is a great answer, I have an edited version in a new answer that returns the original string and index of each match stackoverflow.com/a/71321576/5125264
1

@Martin Evans answer was great but I wanted to also return the locations of the match within the string:

>>> text = """these are just rubbish 11-2-2222, 24-3-1695-194475 12-13-1111, 32/11/2000
these are dates 4-02-2011, 12/12/1990, 31-11-1690,  11 July 1990, 7 Oct 2012
these are actual deal- by 12 December six people died and in June 2000 he told, by 5 July 2001, he will leave."""

>>> find_dates(text)

[('2011-02-04', 90, 99, '4-02-2011'),
 ('1990-12-12', 101, 111, '12/12/1990'),
 ('1990-07-11', 126, 138, '11 July 1990'),
 ('2012-10-07', 140, 150, '7 Oct 2012'),
 ('2022-12-12', 177, 192, '12 December six'),
 ('2000-06-01', 212, 224, 'June 2000 he'),
 ('2001-07-05', 234, 245, '5 July 2001')]

I have wrapped it up in a function and users finditer instead of findall

from itertools import tee
from datetime import datetime
import re

def find_dates(
    text,
    valid_from = datetime(1920, 1, 1),
    valid_to = datetime(2030, 1, 1),
    default_year = datetime.now().year,
    dt_formats = [
        ['%d', '%m', '%Y'], 
        ['%d', '%b', '%Y'],
        ['%d', '%B', '%Y'],
        ['%d', '%b'],
        ['%d', '%B'],
        ['%b', '%d'],
        ['%B', '%d'],
        ['%b', '%Y'],
        ['%B', '%Y'],
    ],
    ):
    # store your matches here
    dates = []
        
    t1, t2, t3 = tee(list(re.finditer(r'\b\w+\b', text)), 3)
    next(t2, None)
    next(t3, None)
    next(t3, None)
    triples = zip(t1, t2, t3)

    for triple in triples:
        # get start and end index of each triple
        start = triple[0].start()
        end = triple[-1].end()

        # convert mathes to a list of three strings
        triple = [text[t.start():t.end()] for t in triple]

        for dt_format in dt_formats:
            try:
                dt = datetime.strptime(' '.join(triple[:len(dt_format)]), ' '.join(dt_format))

                if '%Y' not in dt_format:
                    dt = dt.replace(year=default_year)

                if valid_from <= dt <= valid_to:
                    dates.append((dt.strftime('%Y-%m-%d'), start, end, text[start:end]))

                    for skip in range(1, len(dt_format)):
                        next(triples)
                break

            except ValueError:
                pass
            
    return dates

There is some bug though as you can see ('2000-06-01', 212, 224, 'June 2000 he'). Although a better approach may be to do something with dateutil.parser.parse like in https://stackoverflow.com/a/33051237/5125264

Comments

0

Use this : r'\d{,2}-[A-Za-z]{,9}-\d{,4}'

import re
re.match(r'\d{,2}\-[A-Za-z]{,9}\-\d{,4}','Your Date')

This can match dates of formats : '14-Jun-2021' , '4-september-20'

1 Comment

But it also matches 69-PaNcAkEs-321

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.