Capturing repeating subpatterns in Python regex

Question

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, [email protected] matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module. — Todd Owen
– Todd Owen, Commented Oct 15, 2018 at 0:27
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module. — Michael Ohlrogge
– Michael Ohlrogge, Commented Nov 25, 2018 at 0:22
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses). — Todd Owen
– Todd Owen, Commented Nov 28, 2018 at 21:36
@ToddOwen Got it, thank you, that is a helpful clarification! — Michael Ohlrogge
– Michael Ohlrogge, Commented Nov 29, 2018 at 1:03

Community · Accepted Answer · 2017-05-23 12:09:44Z

47

re module doesn't support repeated captures (regex supports it):

>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', '[email protected]')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Mar 19, 2012 at 5:22

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Li-aung Yip Over a year ago

Out of curiosity, how do you write a replacement pattern when you match repeated captures? Does the meaning of \1, \2, \3 etc. change depending on how many times you matched (\.\w+)?

jfs Over a year ago

@Li-aung Yip: \1 corresponds to m.group(1); the meaning hasn't changed. You could use a function as a replacement pattern and call m.captures() in it.

Li-aung Yip Over a year ago

In your example, the meaning of \1, \2, and \3 is obvious because they only capture once. But what is the meaning of \4, corresponding to (\.\w+)+? \4 appears to be "the last substring matched by the 4th capture group", in this case .tr.

jfs Over a year ago

@Li-aung Yip: m.groups() above explicitly shows what \4 is.

jfs Over a year ago

The meaning hasn't changed: \4 is m.group(4) whatever it is.

Taymon · Accepted Answer · 2012-03-19 04:28:11Z

14

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

answered Mar 19, 2012 at 4:28

Taymon

25.8k9 gold badges65 silver badges84 bronze badges

4 Comments

scharfmn Over a year ago

For abbreviations (if you've lower-cased): re.sub(ur'((?:[a-z]\.){2,})', lambda m: m.group(1).replace('.', ''), text)

Tim Swena Over a year ago

Thanks. I was able adding parentheses allowed me to match a repeated subpattern, but then there was a group in the match with the last one of the pattern. I hadn't seen that (?: ...) makes a non-capturing group. docs.python.org/2/library/re.html#regular-expression-syntax Adding that fixes that problem.

Jules Gagnon-Marchand Over a year ago

this doesn't split the groups

Tushar Vazirani Over a year ago

This doesn't even capture the pattern. It's a non-capturing group by definition!

Community · Accepted Answer · 2017-05-23 11:46:24Z

12

This will work:

>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "[email protected]"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it's limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

edited May 23, 2017 at 11:46

CommunityBot

11 silver badge

answered Mar 19, 2012 at 4:50

Li-aung Yip

12.5k5 gold badges36 silver badges51 bronze badges

Comments

dsal3389 · Accepted Answer · 2025-02-03 11:01:48Z

1

couple of years late, but ill write it here in case someone needs it.

you can group the repeated group like so (simplified)

import re

domains_re = re.compile(r"\w+\@[\w\d]+(?P<domains>(?:\.[\w\d]+)+)")
print(domains_re.match("[email protected]").group("domains"))
# .something.edu.tr

(?:\.[\w\d]+)+ is the repeated group to catch Top Level Domains, but we don't capture them with ?:, we wrap it in another group and we name it "domains" (?P<domains>...)

answered Feb 3 at 11:01

dsal3389

7201 gold badge10 silver badges28 bronze badges

Comments

Dhruv · Accepted Answer · 2025-08-28 04:57:43Z

0

You can also try using findall.

re.findall(r'(\.\w+)', email_string)

answered Aug 28 at 4:57

Dhruv

5973 silver badges27 bronze badges

Collectives™ on Stack Overflow

Capturing repeating subpatterns in Python regex

5 Answers 5

5 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related