45

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, [email protected] matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

4
  • 2
    Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module. Commented Oct 15, 2018 at 0:27
  • @ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module. Commented Nov 25, 2018 at 0:22
  • 3
    @MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses). Commented Nov 28, 2018 at 21:36
  • @ToddOwen Got it, thank you, that is a helpful clarification! Commented Nov 29, 2018 at 1:03

5 Answers 5

47

re module doesn't support repeated captures (regex supports it):

>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', '[email protected]')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.

Sign up to request clarification or add additional context in comments.

5 Comments

Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of \1, \2, \3 etc. change depending on how many times you matched (\.\w+)?
@Li-aung Yip: \1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.
In your example, the meaning of \1, \2, and \3 is obvious because they only capture once. But what is the meaning of \4, corresponding to (\.\w+)+? \4 appears to be "the last substring matched by the 4th capture group", in this case .tr.
@Li-aung Yip: m.groups() above explicitly shows what \4 is.
The meaning hasn't changed: \4 is m.group(4) whatever it is.
14

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

4 Comments

For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z]\.){2,})', lambda m: m.group(1).replace('.', ''), text)
Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.
this doesn't split the groups
This doesn't even capture the pattern. It's a non-capturing group by definition!
12

This will work:

>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "[email protected]"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it's limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

Comments

1

couple of years late, but ill write it here in case someone needs it.

you can group the repeated group like so (simplified)

import re

domains_re = re.compile(r"\w+\@[\w\d]+(?P<domains>(?:\.[\w\d]+)+)")
print(domains_re.match("[email protected]").group("domains"))
# .something.edu.tr

(?:\.[\w\d]+)+ is the repeated group to catch Top Level Domains, but we don't capture them with ?:, we wrap it in another group and we name it "domains" (?P<domains>...)

Comments

0

You can also try using findall.

re.findall(r'(\.\w+)', email_string)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.