1

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.

I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.

Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

I am very thankful for every help!

8
  • You say I want to extract all text starting from "Dear" till "Douglas", but your regex has no Douglas. The ,\n\S+ would prevent the regex from matching even if you add it to the letter_end. Maybe all you want is regex = r"(?:" + openings + r").*?" + r"(?:" + closings + r")"? Commented Nov 6, 2018 at 10:05
  • @WiktorStribiżew: Many thanks for your help - this already looks quite good! Do you have any idea how to also get the next 5 words after the defined "letter_end"? (So I can get whatever name is after the closing expression?) Commented Nov 6, 2018 at 10:18
  • How do you define "word"? What chars can there be between them? Look here, if you match 5 words, you might get more than just Douglas. Commented Nov 6, 2018 at 10:21
  • Okay, I see the problem. Is there a way to tell the regex to get the "next 2 lines" after "letter_end", as the "other random text" will only start at least 3 lines from the letter_end? -> r"(?:" + openings + r").*?" + r"(?:" + closings + [\Line+\Line+){0,2} r")" ? Commented Nov 6, 2018 at 10:25
  • Remove re.DOTALL and use regex101.com/r/PmU3Ti/2, i.e. regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}". You do not need re.MULTILINE either, BTW. Commented Nov 6, 2018 at 10:26

1 Answer 1

1

You may use

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

This pattern will result in a regex like

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.

Details

  • (?:dear|to our|estimated) - any of the three values
  • [\s\S]*? - any 0+ chars, as few as possible
  • (?:sincerely|yours|best regards) - any of the three values
  • .* - any 0+ chars other than newline
  • (?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.

Python demo code:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

Output:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']
Sign up to request clarification or add additional context in comments.

2 Comments

Many thanks Wiktor! I need to have one last edit on the regex code: I need the output text to stop before the first line after the "letter_end" match with more than 20 chars in that line. In the above example it would generate the same output as len("Random text here as well") = 24. The conditions to meet at the end of the regex statements: stop at line after match of "letter_end" where the line contains > 20 chars)
@DominikScheld r"(?:{})[\s\S]*?(?:{}).*(?:\n.{{0,19}}$)*" but you need to use re.M` flag with it. Here is a demo

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.