While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, [email protected] matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
5 Answers
re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', '[email protected]')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.
5 Comments
\1, \2, \3 etc. change depending on how many times you matched (\.\w+)?\1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.\1, \2, and \3 is obvious because they only capture once. But what is the meaning of \4, corresponding to (\.\w+)+? \4 appears to be "the last substring matched by the 4th capture group", in this case .tr.m.groups() above explicitly shows what \4 is.\4 is m.group(4) whatever it is.You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)
4 Comments
re.sub(ur'((?:[a-z]\.){2,})', lambda m: m.group(1).replace('.', ''), text)(?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.This will work:
>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "[email protected]"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
Comments
couple of years late, but ill write it here in case someone needs it.
you can group the repeated group like so (simplified)
import re
domains_re = re.compile(r"\w+\@[\w\d]+(?P<domains>(?:\.[\w\d]+)+)")
print(domains_re.match("[email protected]").group("domains"))
# .something.edu.tr
(?:\.[\w\d]+)+ is the repeated group to catch Top Level Domains, but we don't capture them with ?:, we wrap it in another group and we name it "domains" (?P<domains>...)
(?: ...)are not capturing parentheses).