Capture repeated groups in python regex

Question

I have a mail log file, which is like this:

Aug 15 00:01:06 **** sm-mta*** to=<[email protected]>,<[email protected]>,[email protected], some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<[email protected]>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<[email protected]>,<[email protected]>, some_more_stuff

What I want is a list of all mail hosts in lines that contain "sm-mta". In this case that would be: ['gmail.com', 'yahoo.com', 'aol.com', 'gmail.com', gmail.com']

re.findall(r'sm-mta.*to=.+?@(.*?)[>, ]') will return only first host of each matching line (['gmail.com','gmail.com'])

re.findall(r'.+?@(.*?)[>, ]') will return the correct list, but I need filtering too. Is there any workaround on this?

You can try this one eval.in/875159

Sahil Gulati
– Sahil Gulati

2017-10-06 10:53:15 +00:00
Commented Oct 6, 2017 at 10:53 — Sahil Gulati
– Sahil Gulati, Commented Oct 6, 2017 at 10:53

Wiktor Stribiżew · Accepted Answer · 2017-10-06 11:19:45Z

If you cannot use PyPi regex library, you will have to do that in two steps: 1) grab the lines with sm-mta and 2) grab the values you need, with something like

import re

txt="""Aug 15 00:01:06 **** sm-mta*** to=<[email protected]>,<[email protected]>,[email protected], some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<[email protected]>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<[email protected]>,<[email protected]>, some_more_stuff"""
rx = r'@([^\s>,]+)'
filtered_lines = [x for x in txt.split('\n') if 'sm-mta' in x]
print(re.findall(rx, " ".join(filtered_lines)))

See the Python demo online. The @([^\s>,]+) pattern will match @ and will capture and return any 1+ chars other than whitespace, > and ,.

If you can use PyPi regex library, you may get the list of the strings you need with

>>> import regex
>>> x="""Aug 15 00:01:06 **** sm-mta*** to=<[email protected]>,<[email protected]>,[email protected], some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<[email protected]>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<[email protected]>,<[email protected]>, some_more_stuff"""
>>> rx = r'(?:^(?=.*sm-mta)|\G(?!^)).*?@\K[^\s>,]+'
>>> print(regex.findall(rx, x, regex.M))
['gmail.com', 'yahoo.com', 'aol.com,', 'gmail.com', 'gmail.com']

See the Python online demo and a regex demo.

Pattern details

(?:^(?=.*sm-mta)|\G(?!^)) - a line that has sm-mta substring after any 0+ chars other than line break chars, or the place where the previous match ended
.*?@ - any 0+ chars other than line break chars, as few as possible, up to the @ and a @ itself
\K - a match reset operator that discards all the text matched so far in the current iteration
[^\s>,]+ - 1 or more chars other than whitespace, , and >

vks · Accepted Answer · 2017-10-06 10:49:57Z

1

Try regex module.

x="""Aug 15 00:01:06 **** sm-mta*** to=<[email protected]>,<[email protected]>,[email protected], some_more_stuff
Aug 16 13:16:09 **** sendmail*** to=<[email protected]>, some_more_stuff
Aug 17 11:14:48 **** sm-mta*** to=<[email protected]>,<[email protected]>, some_more_stuff"""
import regex
print regex.findall(r"sm-mta.*to=\K|\G(?!^).+?@(.*?)[>, ]", x, version=regex.V1)

Output: ['', 'gmail.com', 'yahoo.com', 'aol.com', '', 'gmail.com', 'gmail.com']

Just ignore the first empty match.

https://regex101.com/r/7zPc6j/1

answered Oct 6, 2017 at 10:49

vks

68.1k11 gold badges96 silver badges132 bronze badges

Collectives™ on Stack Overflow

Capture repeated groups in python regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related