Extract a string after a text with regex in Python

Question

I have a doc file that it has the following structure:

This is a fairy tale written by

    John Doe and Mary Smith
    
    Auckland,somewhere
    
 This story is awesome

I would like to extract the two lines of text which are:

        John Doe and Mary Smith
        
        Auckland,somewhere

and append those values into a list by using regex. The two lines that I want to extract are always between the lines This is a fairy tale written by and This story is awesome. How can I do that? I have tried some combinations with before_keyword,keyword,after_keyword=text.partition(regex), but no luck at all.

@thisisjaymehta exactly, I want to extract thow two lines that are between the other two strings — Little
– Little, Commented Aug 21, 2020 at 4:10
No i mean regardless of what is above and below of 2nd and 3rd line. Just extract 2nd and 3rd line, without checking what is on 1st and 4th line. Will that work? — thisisjaymehta
– thisisjaymehta, Commented Aug 21, 2020 at 4:11
You will want to have a look at the re (regex) standard library. Specifically re.search(): docs.python.org/3/library/re.html#re.search Give that a read and if you still have questions, please advise. — kerasbaz
– kerasbaz, Commented Aug 21, 2020 at 4:11
@thisisjaymehta not really, I want exactly the two lines between those strings — Little
– Little, Commented Aug 21, 2020 at 4:17

alani · Accepted Answer · 2020-08-21 04:25:47Z

0

You can use a regex with re.DOTALL that enables . to match any character including newlines. Once you have the text between the two delimiters, you can use another regex without the re.DOTALL to extract lines that contain at least one non-whitespace character (\S).

import re

lst = []

with open('input.txt') as f:
    text = f.read()

match = re.search('This is a fairy tale written by(.*?)This story is awesome', 
                  text, re.DOTALL)

if match:
    lst.extend(re.findall('.*\S.*', match.group(1)))

print(lst)

Gives:

['    John Doe and Mary Smith', '    Auckland,somewhere']

edited Aug 21, 2020 at 4:25

answered Aug 21, 2020 at 4:20

alani

13.2k3 gold badges18 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Little Over a year ago

thanks, but I don´t know why I get the name and after that I get None

alani Over a year ago

What exactly do you get? re.findall should only return a list of strings, not None.

adrtam · Accepted Answer · 2020-08-21 04:18:47Z

0

You may start with this:

re.search(r'(?<=This is a fairy tale written by\n).*?(?=\n\s*This story is awesome)', s, re.MULTILINE|re.DOTALL).group(0)

and fine-tune this regex. re.MULTILINE may be omitted as you do not have ^ or $ anyway, but re.DOTALL is required to let . to match newline as well. The regex above uses look ahead and look behind (?<=), (?=). If you do not like that, you can use parentheses instead for captures.

answered Aug 21, 2020 at 4:18

adrtam

7,2712 gold badges15 silver badges28 bronze badges

Comments

thisisjaymehta · Accepted Answer · 2020-08-21 04:31:45Z

If you can create a list of strings from your docfile, then no need to use a regex. Just do this simple program:

fileContent = ['This is a fairy tale written by','John Doe and Mary Smith','Auckland,somewhere','This story is awesome',
               'Some other things', 'story texts', 'Not Important data',
               'This is a fairy tale written by','Kem Cho?','Majama?','This story is awesome', 'Not important data']
               
authorsList = []
for i in range(len(fileContent)-3):
    if fileContent[i] == 'This is a fairy tale written by' and fileContent[i+3] == 'This story is awesome':
        authorsList.append([fileContent[i+1], fileContent[i+2]])

print(authorsList)

Here I simply check for 'This is a fairy tale written by' and 'This story is awesome' and if it is found, append text between it in your list.

Output:

[['John Doe and Mary Smith', 'Auckland,somewhere'], ['Kem Cho?', 'Majama?']]

Arundeep Chohan · Accepted Answer · 2020-08-21 05:13:30Z

0

Try using this instead. It should match anything between these two strings.

re.search(r'(?<=This is a fairy tale).*?(?=This story is awesome)',text)

answered Aug 21, 2020 at 5:13

Arundeep Chohan

9,9895 gold badges17 silver badges36 bronze badges

Collectives™ on Stack Overflow

Extract a string after a text with regex in Python

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related