Python Regex - Extract text between (multiple) expressions in a textfile

Question

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.

I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.

Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

I am very thankful for every help!

You say I want to extract all text starting from "Dear" till "Douglas", but your regex has no Douglas. The ,\n\S+ would prevent the regex from matching even if you add it to the letter_end. Maybe all you want is regex = r"(?:" + openings + r").*?" + r"(?:" + closings + r")"? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 6, 2018 at 10:05
@WiktorStribiżew: Many thanks for your help - this already looks quite good! Do you have any idea how to also get the next 5 words after the defined "letter_end"? (So I can get whatever name is after the closing expression?) — Dominik Scheld
– Dominik Scheld, Commented Nov 6, 2018 at 10:18
How do you define "word"? What chars can there be between them? Look here, if you match 5 words, you might get more than just Douglas. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 6, 2018 at 10:21
Okay, I see the problem. Is there a way to tell the regex to get the "next 2 lines" after "letter_end", as the "other random text" will only start at least 3 lines from the letter_end? -> r"(?:" + openings + r").*?" + r"(?:" + closings + [\Line+\Line+){0,2} r")" ? — Dominik Scheld
– Dominik Scheld, Commented Nov 6, 2018 at 10:25
Remove re.DOTALL and use regex101.com/r/PmU3Ti/2, i.e. regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}". You do not need re.MULTILINE either, BTW. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 6, 2018 at 10:26

Wiktor Stribiżew · Accepted Answer · 2018-11-06 15:08:30Z

1

You may use

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

This pattern will result in a regex like

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.

Details

(?:dear|to our|estimated) - any of the three values
[\s\S]*? - any 0+ chars, as few as possible
(?:sincerely|yours|best regards) - any of the three values
.* - any 0+ chars other than newline
(?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.

Python demo code:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

Output:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

answered Nov 6, 2018 at 15:08

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dominik Scheld Over a year ago

Many thanks Wiktor! I need to have one last edit on the regex code: I need the output text to stop before the first line after the "letter_end" match with more than 20 chars in that line. In the above example it would generate the same output as len("Random text here as well") = 24. The conditions to meet at the end of the regex statements: stop at line after match of "letter_end" where the line contains > 20 chars)

Wiktor Stribiżew Over a year ago

@DominikScheld r"(?:{})[\s\S]*?(?:{}).*(?:\n.{{0,19}}$)*" but you need to use re.M` flag with it. Here is a demo

Collectives™ on Stack Overflow

Python Regex - Extract text between (multiple) expressions in a textfile

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related