8

I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.

Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)

This is what I have so far:

pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)

I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').

Why does the extra '19' appear at the end?

Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28

Any help would be much appreciated including constructive criticism on how I am going about designing my regex.

3
  • 5
    Why do you need to use a regular expression? (Now you have two problems...) Commented Apr 25, 2012 at 3:33
  • I believe the quote @Wooble is referring to is 'Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.' and I'm inclined to agree. I recommend extracting a string and 2 numbers(perhaps with a simple, simple regex but more likely just by splitting the string on commas) and then using datetime to test whether the date is valid. Commented Apr 25, 2012 at 3:36
  • Thanks for the advice, but this is a homework assignment where I'm required to make an expression to match dates. Commented Apr 25, 2012 at 3:46

6 Answers 6

6

Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):

years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years

thirties = pattern % (
     "September|April|June|November",
     r'0?[1-9]|[12]\d|30')

thirtyones = pattern % (
     "January|March|May|July|August|October|December",
     r'0?[1-9]|[12]\d|3[01]')

fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))

feb = r'(February) +(?:%s|%s)' % (
     r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
     r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours)  # 29 leap years only

result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result

Then we have:

>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False

And what is this glorious regexp, you may ask?

>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))

(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for that! I'm still going through the regexp you gave me slowly to dissect and understand the individual components but I see how the best way to go about doing it would have been to group together months with basically no differences other than name and separate February out from the rest and match that in another part of the expression
I said there was "no easy way" to make a regular expression check the month against the date. So you showed how to do it... the hard way... you, sir, are insane, but it's the good kind of insanity. +1! P.S. I especially like the leap year checker.
pattern = r'(%s) +(%s), *%s' % years showing error for me .. pattern = '(%s) +(%s), *%s' % years TypeError: not enough arguments for format string
@monkey Yeah, not sure how that would have ever worked...editing to fix what I think it was intended to be.
@Dougal there is still unbalanced parenthesis in this expression .Please update.. feb = r'(February) +(?:%s|%s)' % ( r'(?:(0?[1-9]|1\d|2[0-8]), *%s' % years, # 1-28 any year r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only
|
2

Here are some quick thoughts:

Everyone who is suggesting you use something other than regular expression is giving you very good advice. On the other hand, it's always a good time to learn more about regular expression syntax...

An expression in square brackets -- [...] -- matches any single character inside those brackets. So writing [,], which only contains a single character, is exactly identical to writing a simple unadorned comma: ,.

The .findall method returns a list of all matching groups in the string. A group is identified by parenthese -- (...) -- and they count from left to right, outermost first. Your final expression looks like this:

((19|20)[0-9][0-9])

The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the final two match groups are going to be 1989 and 19.

2 Comments

Tell him how to fix it, about non-matching groups. (?:19|20)
Nah, I'll let you do it. I'm not really sure it needs "fixing", because there's nothing "broken". I just wanted to explain the behavior.
2

A group is identified by parentheses (...) and they count from left to right, outermost first. Your final expression looks like this:

((19|20)[0-9][0-9])

The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the two match groups are going to be 1989 and 19. Since you don't want the inner group (first two digits), you should use a non-capturing group instead. Non-capturing groups start with ?:, used like this: (?:a|b|c)

By the way, there is some good documentation on how to use regular expressions here.

Comments

1

Python has a date parser as part of the time module:

import time
time.strptime("December 31, 2012", "%B %d, %Y")

The above is all you need if the date format is always the same.

So, in real production code, I would write a regular expression that parses the date, and then use the results from the regular expression to build a date string that is always the same format.

Now that you said, in the comments, that this is homework, I'll post another answer with tips on regular expressions.

2 Comments

I'm required to use regular expressions as this is a homework assignment I'm struggling with
This creates a date object if you have a string that is just the date, but it doesn't work like a regex to match dates in a string or larger text.
1

You have this regular expression:

pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"

One feature of regular expressions is a "character class". Characters in square brackets make a character class. Thus [,] is a character class matching a single character, , (a comma). You might as well just put the comma.

Perhaps you wanted to make the comma optional? You can do that by putting a question mark after it: ,?

Anything you put into parentheses makes a "match group". I think the mysterious extra "19" came from a match group you didn't mean to have. You can make a non-matching group using this syntax: (?:

So, for example:

r'(?:red|blue) socks'

This would match "red socks" or "blue socks" but does not make a match group. If you then put that inside plain parentheses:

r'((?:red|blue) socks)'

That would make a match group, whose value would be "red socks" or "blue socks"

I think if you apply these comments to your regular expression, it will work. It is mostly correct now.

As for validating the date against the month, that is way beyond the scope of a regular expression. Your pattern will match "February 31" and there is no easy way to fix that.

Comments

0

First of all as other as said i don't think that regular expression are the best choice to solve this problem but to answer your question. By using parenthesis you are dissecting the string into several subgroups and when you call the function findall, you will create a list with all the matching group you created and the matching string.

((19|20)[0-9][0-9])

Here is your problem, the regex will match both the entire year and 19 or 20 depending on whether the year start with 19 or 20.

1 Comment

Your parentheses are unbalanced.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.