1

I have the following problem. I want to take specific strings from a multiple text file, there is a certain pattern in the text files. for example

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"

Each file is very different, but in all the file I want the text 1: between the words 'Pear' and 'Apple' i have solved this with the following code:

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)

which returns:

['this should be included1 ', 'this should be included2 ']

The think which i can not find is that i also want the string on the end, the 'this should be included3' part. So i was wondering if there is a way to specify with regex something like

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)

so how can a match something between the word 'Pear' and EOF (end of file) ? Notice that these are all text files (so not specificly one sentence)

3
  • 1
    You probably want to match Pear\s+ rather than Pear+\s. That way you match 1 or more whitespace characters as opposed to 1 or more 'r' characters ;-) Commented Jan 20, 2017 at 13:39
  • 1
    If you have a very large input, use r'Pear\s+([^A]*(?:A(?!pple)[^A]*)*)' Commented Jan 20, 2017 at 13:40
  • 1
    When given Pear and Pear and one Apple what should be returned? Commented Jan 20, 2017 at 13:46

1 Answer 1

4

Select either Apple or $ (an anchor matching the end of the string):

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)

| specifies two alternatives, and (?:...) is a non-capturing group, so that the parser knows to pick either Apple or $ as the match.

Note that I replaced Pear+\s with Pear\s+, as I suspect you want to match arbitrary whitespace, not an arbitrary number of r characters.

Demo:

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']
Sign up to request clarification or add additional context in comments.

4 Comments

I think the Pear+\s should be written as Pear\s+
@WiktorStribiżew: probably; it'll work for the demo input, but they probably meant to match one or more spaces, not one or more r characters ;-)
You maybe have missed a colon (:) after the question mark in your non-capturing group explanation.
@Niitaku: ta, I did.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.