Regex to capture different elements in unknown order of appearance in Python

Question

I am building a regex to extract the header values from a forwarded email in Python. I am only interested in the first appearance of these kinds of headers in an email and I only want to capture the text parts appearing after the colons.

From: ...  
Sent: ...   
To: ...   
Subject: ...

The following regex works fine using re.search for most variations of the above format:

(?:From\s*:\s*)(.*)(?:\n*)(?:Sent\s*:\s*)(.*)(?:\n*)(?:To\s*:\s*)(.*)(?:\n*)(?:Subject\s*:\s*)

but sometimes, the different header parts are ordered differently and have missing elements, such as below:

Sent: ...    
From: ...  
Subject: ...

I thought I could use a positive lookahead to match the header format in any order but I could not get this to work. Does anyone have any idea how this can be done efficiently? Any help is greatly appreciated.

I don't think I can use this library as my data set is multilingual. — user10218300
– user10218300, Commented Aug 13, 2018 at 10:10

CertainPerformance · Accepted Answer · 2018-08-13 10:23:46Z

1

One possibility would be to never consume any characters, and use lookahead to capture everything you need in optional groups:

(?=(?:.*^From\s*:\s*)(.*?$)|)(?=(?:.*^Sent\s*:\s*)(.*?$)|)(?=(?:.*^To\s*:\s*)(.*?$)|)(?=(?:.*^Subject\s*:\s*)(.*?$)|)

https://regex101.com/r/pOThDP/2

Spaced out, that's just 4 repititions of a similar pattern that looks like:

(?=(?:.*^From\s*:\s*)(.*?$)|)
(?=(?:.*^Sent\s*:\s*)(.*?$)|)
(?=(?:.*^To\s*:\s*)(.*?$)|)
(?=(?:.*^Subject\s*:\s*)(.*?$)|)

Also, you might consider named capture groups, for clarity:

(?=(?:.*^From\s*:\s*)(?P<From>.*?$)|)(?=(?:.*^Sent\s*:\s*)(?P<Sent>.*?$)|)(?=(?:.*^To\s*:\s*)(?P<To>.*?$)|)(?=(?:.*^Subject\s*:\s*)(?P<Subject>.*?$)|)

https://regex101.com/r/pOThDP/3

Edit: Example in python code:

text = '''To: totext
Sent: sent text
this text has no no "from" label
Subject: subject text'''
pattern = re.compile(r'(?=(?:.*^From\s*:\s*)(.*?$)|)(?=(?:.*^Sent\s*:\s*)(.*?$)|)(?=(?:.*^To\s*:\s*)(.*?$)|)(?=(?:.*^Subject\s*:\s*)(.*?$)|)', flags=re.S | re.M)
match = re.search(pattern, text)
print(match.groups())

Output is:

(None, 'sent text', 'totext', 'subject text')

edited Aug 13, 2018 at 10:23

answered Aug 13, 2018 at 9:19

CertainPerformance

373k55 gold badges354 silver badges359 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user10218300 Over a year ago

This looks great! When I try it out in regex101 it works (only issue is that it doesn't just capture the first occurrence). For some reason however, my python code finds a match in each email with this regex that captures only None for each of the values. I'm not sure why though.

user10218300 Over a year ago

Update: I forgot to set the m and s flags. It works perfectly now! Do you have any idea how this can be adapted to only return the first match for each of the different parts?

CertainPerformance Over a year ago

Make all the quantifiers before the label text lazy instead of greedy, eg (?=(?:.*?^From regex101.com/r/pOThDP/4

user10218300 Over a year ago

I'm sorry, I meant it differently.. Let's say the first header block contains only From, Sent, To, and the second block contains From, Sent, To, Subject, then I don't want the result to include the Subject line of the second forwarded block, but now that one is captured because the first block didn't contain that one.

CertainPerformance Over a year ago

Can you give a fuller example of the input and desired output? You might require there not be two line breaks in a row, maybe

Michał Turczyn · Accepted Answer · 2018-08-13 09:22:45Z

0

Try this pattern: \G(From:|Subject:|Sent:|To:)(.+)\n

Requirement, that it should capture only first block, is fulfilled by \G anchor, which makes sure next match (Sent/To/From/Subject) is met right after the previous one, so header of another mail isn't matched, because it will be separated by content of an e-mail.

Alternation makes sure, that it will match the header independently of the order of Sent/To/From/Subject.

Demo

answered Aug 13, 2018 at 9:22

Michał Turczyn

41.2k18 gold badges58 silver badges87 bronze badges

Collectives™ on Stack Overflow

Regex to capture different elements in unknown order of appearance in Python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related