0

My question is to extract a certain paragraph (e.g., usually a middle paragraph) from a file through the regex in Python.

An example file is as follows:

poem = """The time will come
when, with elation,
you will greet yourself arriving
at your own door, in your own mirror,
and each will smile at the other's welcome,
and say, sit here. Eat.
You will love again the stranger who was your self.
Give wine. Give bread. Give back your heart
to itself, to the stranger who has loved you

all your life, whom you ignored
for another, who knows you by heart.
Take down the love letters from the bookshelf,

the photographs, the desperate notes,
peel your own image from the mirror.
Sit. Feast on your life."""

How to extract the second paragraph (which means "all you life ... the bookshelf,") of this poem use regex in python?

3
  • Just capture anything that's between \n\n. Commented Oct 4, 2017 at 5:08
  • I am struggling with the pattern of the second paragraph right now. NEED HELP! Commented Oct 4, 2017 at 5:08
  • @BurhanKhalid Could you provide me with the specific code to capture anything that's between two \n\n? Thank you so much Commented Oct 4, 2017 at 5:09

3 Answers 3

1

Use group capturing and try this out:

import re


pattern=r'^(all.*bookshelf[,\s])'

second=re.search(pattern,poem,re.MULTILINE | re.DOTALL)
print(second.group(0))
Sign up to request clarification or add additional context in comments.

Comments

0

Use a positive look-ahead and look-behind:

(?<=\n\n).+(?=\n\n)

The (?<=\n\n) at the start there is a look-behind. It only matches the things after it if there is \n\n behind it.

The last bit (?=\n\n) is a look-ahead, which only matches the thing before it if there are \n\n after it.

Try it out: https://regex101.com/r/7XnDjS/1

5 Comments

Thank you for you help. I added your code like this: paragraph =re.match(r'(?<=\n\n).+(?=\n\n)', poem) print(paragraph). However, the result is "None" in the shell.
@hoperose You have to use search instead of match. Also, call group(0) on the return value to get the matched string.
like this: paragraph = re.search(r'(?<=\n\n).+(?=\n\n)', poem) print(paragraph.group(0))?
result=paragraph.group(0) AttributeError: 'NoneType' object has no attribute 'group'
It does work: repl.it/MD7v/0 One reason why this might not work might be that you are using Windows, where new lines are represented by \r\n, but I don't have a Windows PC so I'm not sure. Try replacing the \n\ns with \r\n\r\n. @hoperose
0

It may be important that some Windows text files end a line with \r\n instead of just \n. Python has excellent documentation on regular expressions. Just google "python regexp". You could even google "perl regexp" since Python copied regexp from Perl ;-) One way to get just the second paragraph text would be to use () to grab the text between two groups of two or more line endings like this:

myPattern = re.compile('[^\r\n]+\r?\n\r?\n+([^\r\n]+)\r?\n\r?\n.*')

and then use it like this:

secondPara = myPattern.sub("\\1", content)

Here's my script in action:

schumack@linux2 137> ./poem2.py
secondPara: all your life, whom you ignored for another, who knows you by heart. Take down the love letters from the bookshelf,

1 Comment

Thank you @ Ken Schumack. Nonetheless, the running results give back the whole content. I don't know why

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.