RegEx - Parsing name and last name from a string

Question

I'm trying to parse all the instances of a name and a last name from a string in an outlook "to" convention, and save each one in a python list. I'm using Python 3.6.4.
For example, I would like the folllowing string:

"To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;"

to be parsed into:

['John Lennon','Paul McCartney']

I used Replace all words from word list with another string in python as a reference and came up with this code:

import re
prohibitedWords = [r'to:',r'To:','\b002',"\<(.*?)\>"]
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
big_regex = re.compile('|'.join(prohibitedWords))
the_message = big_regex.sub("", str(mystring)).strip()
print(the_message)

However, I'm getting the following results:

John Lennon  ; Paul McCartney  ;

This is not optimal as I'm getting lots of spaces which I cannot parse. In addition, I have a feeling this is not the optimal approach for this. Appreciate any advice.
Thanks

Maybe use re.findall('(?:\\bto:|\b002;)\s*(.*?)\s*<', mystring, re.I)? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 16, 2021 at 16:13

The fourth bird · Accepted Answer · 2021-11-16 19:45:24Z

1

Using re.sub and creating an alternation with these parts [r'to:',r'To:','\b002',"\<(.*?)\>"] you will replace the matches with an empty string.

If all the characters that you want to remove are gone, you will end up with a string John Lennon Paul McCartney as in this Python example where you don't know which part belongs where if you for example want to split.

Also removing the surrounding whitespace chars might lead to unexpected gaps or concatenation results when removing them.

You could make the match more specific by matching the possible leading parts, and capture the part that you want instead of replacing.

(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>

(?:\\b[Tt]o:|\b002;) Match either To to or a backspace char and 002
\s* Match optional whitespace chars
(.+?) Capture 1 or more chars in group 1
\s* Match optional whitspace chars
<[^<>@]+@[^<>@]+> Match a single @ between tags

See a regex demo and a Python demo.

For example

import re

pattern = "(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>"
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
print(re.findall(pattern, mystring))

Output

['John Lennon', 'Paul McCartney']

edited Nov 16, 2021 at 19:45

answered Nov 16, 2021 at 16:43

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

\b in \b002; is a backspace char. OP has got a simple string literal, not a raw one.

The fourth bird Over a year ago

@WiktorStribiżew So \b as reading here without the r' would become \x08 in both the regex and mystring like "(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>" right?

Collectives™ on Stack Overflow

RegEx - Parsing name and last name from a string

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related