Find string between two substrings AND between string and the end of file

Question

I have the following problem. I want to take specific strings from a multiple text file, there is a certain pattern in the text files. for example

example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"

Each file is very different, but in all the file I want the text 1: between the words 'Pear' and 'Apple' i have solved this with the following code:

x = re.findall(r'Pear+\s(.*?)Apple', example_file ,re.DOTALL)

which returns:

['this should be included1 ', 'this should be included2 ']

The think which i can not find is that i also want the string on the end, the 'this should be included3' part. So i was wondering if there is a way to specify with regex something like

 x = re.findall(r'Pear+\s(.*?)Apple OR EOF', example_file ,re.DOTALL)

so how can a match something between the word 'Pear' and EOF (end of file) ? Notice that these are all text files (so not specificly one sentence)

You probably want to match Pear\s+ rather than Pear+\s. That way you match 1 or more whitespace characters as opposed to 1 or more 'r' characters ;-) — Martijn Pieters
– Martijn Pieters, Commented Jan 20, 2017 at 13:39
If you have a very large input, use r'Pear\s+([^A]*(?:A(?!pple)[^A]*)*)' — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jan 20, 2017 at 13:40
When given Pear and Pear and one Apple what should be returned? — georg
– georg, Commented Jan 20, 2017 at 13:46

Martijn Pieters · Accepted Answer · 2017-01-20 13:43:12Z

4

Select either Apple or $ (an anchor matching the end of the string):

x = re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)

| specifies two alternatives, and (?:...) is a non-capturing group, so that the parser knows to pick either Apple or $ as the match.

Note that I replaced Pear+\s with Pear\s+, as I suspect you want to match arbitrary whitespace, not an arbitrary number of r characters.

Demo:

>>> import re
>>> example_file = "this is a test Pear this should be included1 Apple this should not be included Pear this should be included2 Apple again this should not be included Pear this should be included3"
>>> re.findall(r'Pear\s+(.*?)(?:Apple|$)', example_file, re.DOTALL)
['this should be included1 ', 'this should be included2 ', 'this should be included3']

edited Jan 20, 2017 at 13:43

answered Jan 20, 2017 at 13:37

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Wiktor Stribiżew Over a year ago

I think the Pear+\s should be written as Pear\s+

Martijn Pieters Over a year ago

@WiktorStribiżew: probably; it'll work for the demo input, but they probably meant to match one or more spaces, not one or more r characters ;-)

Niitaku Over a year ago

You maybe have missed a colon (:) after the question mark in your non-capturing group explanation.

Martijn Pieters Over a year ago

@Niitaku: ta, I did.

Collectives™ on Stack Overflow

Find string between two substrings AND between string and the end of file

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related