5

I have following string:

 dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"

Here I want to extract all mentioned dates using regex. As an attempt I have written following regex:

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4})'

re.findall(regEx, dateEntries)

I was expecting this to work but it only return subset of dates.

A = ['Mar 20, 2009',
 'March 20, 2009',
 'Mar. 20, 2009',
 'Mar 20 2009',
 '20 Mar 2009',
 '20 March 2009',
 '2 Mar. 2009',
 '20 March, 2009',
 'Mar 20th, 2009',
 'Mar 21st, 2009',
 'Mar 22nd, 2009',
 'Feb 2009',
 'Sep 2009',
 'Oct 2010']

I'm not getting why its not returning the dates:

B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]

I created the regEx by extending the r'(?:\d{1,2}[-\s\/])?(?:\d{1,2}[-\/\s])?(?:\d{2,4})' which works good for set B. But regEx is not able to produce A+B

Can anyone help in making a regex for extracting all dates mentioned in my dateEntries ?

NOTE: I want to solve this using regex only.

6
  • Why do you want to use a regex? For your example you could just use dateEntries.split(";"). Commented Jul 1, 2018 at 10:49
  • Because my real data has text file in which set A categories dates are possible, and text file has other data apart from dates. Commented Jul 1, 2018 at 10:52
  • FYI [] matches single characters and character ranges, not strings like th or st. You should replace with () Commented Jul 1, 2018 at 11:01
  • Your second non-capturing group should probably be optional Commented Jul 1, 2018 at 11:05
  • Try (?:[\s]?\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,./]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)?(?:\d{2,4}) here Commented Jul 1, 2018 at 11:10

4 Answers 4

6

You are just missing a single ? after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) group to mark it as not necessary. Additionally I added a + behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.

The full code:

import re

regEx = r'(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+'

dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)

If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip() method

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @Nils. I'm also able to extract using this regex r'(?:\d{1,2}[-/th|st|nd|rd\s.])?(?:(?:Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|August|Sep|September|Oct|October|Nov|November|Dec|December)[\s,.]*)?(?:(?:\d{1,2})[-/th|st|nd|rd\s,.]*)?(?:\d{2,4})'
How to capture this format? 2020 January 1 2020 January 01 ; 2020 Jan. 1 ; 2020 Jan. 01 ; 2020 JAN. 1 ; 2020 JAN. 01
[-/th|st|nd|rd)\s] doesn't do what you think it does. I don't see how this can possibly work.
3

Your regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable

import re
import calendar

full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)

sep = r'[.,]?\s+'               # seperator
day = r'\d+'
year = r'\d+'
day_or_year = r'\d+(?:\w+)?'

r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']

Comments

0

Try Regex:

^(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|\.)?\s)?)?(?:\d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?\s))?)(?:\d{2,4})$

Demo

Comments

-1

You can try the following regex

(?:\d{1,2}[-/th|st|nd|rd\s]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-z\s,.]*(?:\d{1,2}[-/th|st|nd|rd)\s,]*)+(?:\d{2,4})+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.