0

I have a doc file that it has the following structure:

This is a fairy tale written by

    John Doe and Mary Smith
    
    Auckland,somewhere
    
 This story is awesome

I would like to extract the two lines of text which are:

        John Doe and Mary Smith
        
        Auckland,somewhere

and append those values into a list by using regex. The two lines that I want to extract are always between the lines This is a fairy tale written by and This story is awesome. How can I do that? I have tried some combinations with before_keyword,keyword,after_keyword=text.partition(regex), but no luck at all.

6
  • Will extracting 2nd and 3rd lines work for every scenario? Commented Aug 21, 2020 at 4:09
  • @thisisjaymehta exactly, I want to extract thow two lines that are between the other two strings Commented Aug 21, 2020 at 4:10
  • No i mean regardless of what is above and below of 2nd and 3rd line. Just extract 2nd and 3rd line, without checking what is on 1st and 4th line. Will that work? Commented Aug 21, 2020 at 4:11
  • You will want to have a look at the re (regex) standard library. Specifically re.search(): docs.python.org/3/library/re.html#re.search Give that a read and if you still have questions, please advise. Commented Aug 21, 2020 at 4:11
  • @thisisjaymehta not really, I want exactly the two lines between those strings Commented Aug 21, 2020 at 4:17

4 Answers 4

0

You can use a regex with re.DOTALL that enables . to match any character including newlines. Once you have the text between the two delimiters, you can use another regex without the re.DOTALL to extract lines that contain at least one non-whitespace character (\S).

import re

lst = []

with open('input.txt') as f:
    text = f.read()

match = re.search('This is a fairy tale written by(.*?)This story is awesome', 
                  text, re.DOTALL)

if match:
    lst.extend(re.findall('.*\S.*', match.group(1)))

print(lst)

Gives:

['    John Doe and Mary Smith', '    Auckland,somewhere']
Sign up to request clarification or add additional context in comments.

2 Comments

thanks, but I don´t know why I get the name and after that I get None
What exactly do you get? re.findall should only return a list of strings, not None.
0

You may start with this:

re.search(r'(?<=This is a fairy tale written by\n).*?(?=\n\s*This story is awesome)', s, re.MULTILINE|re.DOTALL).group(0)

and fine-tune this regex. re.MULTILINE may be omitted as you do not have ^ or $ anyway, but re.DOTALL is required to let . to match newline as well. The regex above uses look ahead and look behind (?<=), (?=). If you do not like that, you can use parentheses instead for captures.

Comments

0

If you can create a list of strings from your docfile, then no need to use a regex. Just do this simple program:

fileContent = ['This is a fairy tale written by','John Doe and Mary Smith','Auckland,somewhere','This story is awesome',
               'Some other things', 'story texts', 'Not Important data',
               'This is a fairy tale written by','Kem Cho?','Majama?','This story is awesome', 'Not important data']
               
authorsList = []
for i in range(len(fileContent)-3):
    if fileContent[i] == 'This is a fairy tale written by' and fileContent[i+3] == 'This story is awesome':
        authorsList.append([fileContent[i+1], fileContent[i+2]])

print(authorsList)

Here I simply check for 'This is a fairy tale written by' and 'This story is awesome' and if it is found, append text between it in your list.

Output:

[['John Doe and Mary Smith', 'Auckland,somewhere'], ['Kem Cho?', 'Majama?']]

Comments

0

Try using this instead. It should match anything between these two strings.

re.search(r'(?<=This is a fairy tale).*?(?=This story is awesome)',text) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.