2

I am building a regex to extract the header values from a forwarded email in Python. I am only interested in the first appearance of these kinds of headers in an email and I only want to capture the text parts appearing after the colons.

From: ...  
Sent: ...   
To: ...   
Subject: ...  

The following regex works fine using re.search for most variations of the above format:

(?:From\s*:\s*)(.*)(?:\n*)(?:Sent\s*:\s*)(.*)(?:\n*)(?:To\s*:\s*)(.*)(?:\n*)(?:Subject\s*:\s*)

but sometimes, the different header parts are ordered differently and have missing elements, such as below:

Sent: ...    
From: ...  
Subject: ... 

I thought I could use a positive lookahead to match the header format in any order but I could not get this to work. Does anyone have any idea how this can be done efficiently? Any help is greatly appreciated.

2
  • docs.python.org/3/library/email.parser.html Commented Aug 13, 2018 at 9:30
  • I don't think I can use this library as my data set is multilingual. Commented Aug 13, 2018 at 10:10

2 Answers 2

1

One possibility would be to never consume any characters, and use lookahead to capture everything you need in optional groups:

(?=(?:.*^From\s*:\s*)(.*?$)|)(?=(?:.*^Sent\s*:\s*)(.*?$)|)(?=(?:.*^To\s*:\s*)(.*?$)|)(?=(?:.*^Subject\s*:\s*)(.*?$)|)

https://regex101.com/r/pOThDP/2

Spaced out, that's just 4 repititions of a similar pattern that looks like:

(?=(?:.*^From\s*:\s*)(.*?$)|)
(?=(?:.*^Sent\s*:\s*)(.*?$)|)
(?=(?:.*^To\s*:\s*)(.*?$)|)
(?=(?:.*^Subject\s*:\s*)(.*?$)|)

Also, you might consider named capture groups, for clarity:

(?=(?:.*^From\s*:\s*)(?P<From>.*?$)|)(?=(?:.*^Sent\s*:\s*)(?P<Sent>.*?$)|)(?=(?:.*^To\s*:\s*)(?P<To>.*?$)|)(?=(?:.*^Subject\s*:\s*)(?P<Subject>.*?$)|)

https://regex101.com/r/pOThDP/3

Edit: Example in python code:

text = '''To: totext
Sent: sent text
this text has no no "from" label
Subject: subject text'''
pattern = re.compile(r'(?=(?:.*^From\s*:\s*)(.*?$)|)(?=(?:.*^Sent\s*:\s*)(.*?$)|)(?=(?:.*^To\s*:\s*)(.*?$)|)(?=(?:.*^Subject\s*:\s*)(.*?$)|)', flags=re.S | re.M)
match = re.search(pattern, text)
print(match.groups())

Output is:

(None, 'sent text', 'totext', 'subject text')
Sign up to request clarification or add additional context in comments.

5 Comments

This looks great! When I try it out in regex101 it works (only issue is that it doesn't just capture the first occurrence). For some reason however, my python code finds a match in each email with this regex that captures only None for each of the values. I'm not sure why though.
Update: I forgot to set the m and s flags. It works perfectly now! Do you have any idea how this can be adapted to only return the first match for each of the different parts?
Make all the quantifiers before the label text lazy instead of greedy, eg (?=(?:.*?^From regex101.com/r/pOThDP/4
I'm sorry, I meant it differently.. Let's say the first header block contains only From, Sent, To, and the second block contains From, Sent, To, Subject, then I don't want the result to include the Subject line of the second forwarded block, but now that one is captured because the first block didn't contain that one.
Can you give a fuller example of the input and desired output? You might require there not be two line breaks in a row, maybe
0

Try this pattern: \G(From:|Subject:|Sent:|To:)(.+)\n

Requirement, that it should capture only first block, is fulfilled by \G anchor, which makes sure next match (Sent/To/From/Subject) is met right after the previous one, so header of another mail isn't matched, because it will be separated by content of an e-mail.

Alternation makes sure, that it will match the header independently of the order of Sent/To/From/Subject.

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.