I am a Python beginner and would be very thankful if you could help me with my text extraction problem.
I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.
Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.
"""Some random text here
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")
import re
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"
with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)
I am very thankful for every help!
Douglas. The,\n\S+would prevent the regex from matching even if you add it to theletter_end. Maybe all you want isregex = r"(?:" + openings + r").*?" + r"(?:" + closings + r")"?Douglas.re.DOTALLand use regex101.com/r/PmU3Ti/2, i.e.regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}". You do not needre.MULTILINEeither, BTW.