4

I would like to extract dates that is only in the specific format "Month day, year".If it is in any other format, I will skip it. I used the below regex function but only the month is being displayed not the day and year. Can some one point out what is wrong

>>> date_pattern="(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May?|June?
|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?\
s+\d{2},\s+\d{4})"

s = "the date is November 15, 2009"
print(re.findall(date_pattern,s))

Output expected : November 15, 2009

Output of the above code : "November"

2
  • 1
    Does Python offer any API which can parse date strings? That might be an easier option than using regex. Commented Dec 26, 2018 at 7:23
  • The last closing parenthesis is misplaced, move it from the end after Dec(?:ember)? --> Dec(?:ember)?) Commented Dec 26, 2018 at 10:34

3 Answers 3

2

Or use re.search with group(0):

>>> date_pattern='(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}'
>>> s = "the date is November 15, 2009"
>>> re.search(date_pattern,s).group(0)
'November 15, 2009'
>>> 

Visit the regex101 i created for it.

Sign up to request clarification or add additional context in comments.

1 Comment

Why are you using capture groups? It is very less efficient than non capture groups
1

You can change the regex into:

(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May?|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{2},\s+\d{4})

Explanations:

Your current regex accepts the pattern detailed here:

Demo: https://regex101.com/r/0teiAB/3

If you do not add the parenthesis the regex will accept either one of the months defined or Dec(?:ember)?)\s+\d{2},\s+\d{4}) - Dec/December followed by the day and year

Demo: https://regex101.com/r/0teiAB/1

Additional notes:

  • for the days, \d{2} will also accept 33,99,00 that are not proper calendar days!!! -> You can replace this part by (?:0?[1-9]|[1-2][0-9]|30|31) to limit the range as shown in:

Demo: https://regex101.com/r/NTIyf7/1

  • This is not enough if you want to limit the maximum day per month (as there is no 31 February for example), if you want to go to that level of precision you will need to change the regex and use a similar expression as what I have introduced hereover to limit each month.

  • Last but not least, if you go even further and want to defined leap year with February 29. Regex might not be the proper tool for this and you will have to use a Date/Calendar to verify if your date is valid or not.

Comments

1

You missed out the closing parenthesis in your regex pattern. It should come after December to complete the non-capturing group.

(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June|July|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{2},\s+\d{4}

Edit: Actually, it's the positioning of your parenthesis that is incorrect. Instead of being at the end of the pattern, it should come after the December alternative because that is your non-capturing group for month names.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.