0

I'm trying to parse all the instances of a name and a last name from a string in an outlook "to" convention, and save each one in a python list. I'm using Python 3.6.4.
For example, I would like the folllowing string:

"To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;"

to be parsed into:

['John Lennon','Paul McCartney']

I used Replace all words from word list with another string in python as a reference and came up with this code:

import re
prohibitedWords = [r'to:',r'To:','\b002',"\<(.*?)\>"]
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
big_regex = re.compile('|'.join(prohibitedWords))
the_message = big_regex.sub("", str(mystring)).strip()
print(the_message)

However, I'm getting the following results:

John Lennon  ; Paul McCartney  ;

This is not optimal as I'm getting lots of spaces which I cannot parse. In addition, I have a feeling this is not the optimal approach for this. Appreciate any advice.
Thanks

1
  • 1
    Maybe use re.findall('(?:\\bto:|\b002;)\s*(.*?)\s*<', mystring, re.I)? Commented Nov 16, 2021 at 16:13

1 Answer 1

1

Using re.sub and creating an alternation with these parts [r'to:',r'To:','\b002',"\<(.*?)\>"] you will replace the matches with an empty string.

If all the characters that you want to remove are gone, you will end up with a string John Lennon Paul McCartney as in this Python example where you don't know which part belongs where if you for example want to split.

Also removing the surrounding whitespace chars might lead to unexpected gaps or concatenation results when removing them.

You could make the match more specific by matching the possible leading parts, and capture the part that you want instead of replacing.

(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>
  • (?:\\b[Tt]o:|\b002;) Match either To to or a backspace char and 002
  • \s* Match optional whitespace chars
  • (.+?) Capture 1 or more chars in group 1
  • \s* Match optional whitspace chars
  • <[^<>@]+@[^<>@]+> Match a single @ between tags

See a regex demo and a Python demo.

For example

import re

pattern = "(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>"
mystring = 'To: John Lennon <[email protected]> \b002; Paul McCartney <[email protected]> \b002;'
print(re.findall(pattern, mystring))

Output

['John Lennon', 'Paul McCartney']
Sign up to request clarification or add additional context in comments.

2 Comments

\b in \b002; is a backspace char. OP has got a simple string literal, not a raw one.
@WiktorStribiżew So \b as reading here without the r' would become \x08 in both the regex and mystring like "(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>" right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.