0
(?:\d{1,2}[\-\/])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.)\s,]*)+(?:\d{2,4})(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.),]*)

I was trying to extract dates from the text from these ff. format:

  • January 1, 2020
  • January 01, 2020
  • JANUARY 1, 2020
  • JANUARY 01, 2020
  • Jan. 1, 2020
  • Jan. 01, 2020
  • JAN. 1, 2020
  • JAN. 01, 2020
  • 2020 January 1
  • 2020 January 01
  • 2020 Jan. 1
  • 2020 Jan. 01
  • 2020 JAN. 1
  • 2020 JAN. 01
  • 01/01/2020
  • 2020/01/01
  • 01.01.2020
  • 2020.01.01
  • 01-01-2020
  • 2020-01-01

Here's a sample. The problem is when it tries to extract from this format 2020 JAN. 1 , 2020 JAN. 01, 2020 Jan. 01, 2020-01-01.

2
  • I wouldn't do this with one regex, but with one regex per sample/ format. Commented Oct 5, 2020 at 11:36
  • The texts are from document, it is extracted using tesseract. Date formats can be any of the following mentioned above. How would I do it your way? Thanks Commented Oct 5, 2020 at 11:44

1 Answer 1

1

You can use

pattern = r"""(?ix)
  \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) [\s,.]* (?:19|20)(?:\d{2})? # Jan 01 2000
|
  (?<!\d)(?:19|20)(?:\d{2})? [\s,.]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) # 2000 Jan 01
|
 (?<!\d)
   (?:
    (?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01])[-/.]?(?:19|20)\d\d # MM/dd/yyyy
     |
    (?:19|20)\d\d[-/.]?(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01]) # yyyy/MM/dd
   )
 (?!\d)"""

See the regex demo

The i modifier flag enables case insensitive matching and x enables the VERBOSE mode.

Sign up to request clarification or add additional context in comments.

2 Comments

It matches FEB-200 and 02 February 2
@MattMateo Better now? Please adjust the pattern as you see fit, we do not see all your data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.