1

I am working on python to extract certain string between match strings. These strings are generated from a list which is again generated dynamically by a separate python function. The list I am working on looks like this:-

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]

The output I want is similar to this:-

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

As you can see, I want to extract the text/lines which are starting as line1 and ending with line3 (up to line ending). The final output includes both the matching words (ie. line1 and line3).

The code I have tried is:-

# Convert list to string first
list_to_str = '\n'.join(sample_list)
# Get desired output
print(re.findall('\nline1(.*?)\nline2(.*?)\nline3($)', list_to_str, re.DOTALL))

This is what I am getting as an output ():-

[]

Any help is appreciated.

Edit1:- I have done some work and found this nearest solution:-

matches = (re.findall(r"^line1(.*)\nline2(.*)\nline3(.*)$", list_to_str, re.MULTILINE))

for match in matches:
    print('\n'.join(match))

It gives me this output:-

 this line is the first line
 this line is second line to be included in output
 this is the third and it should also be included in output
 this may contain other strings as well
 this line is second line to be included in output...
 this is the third should also be included in output

The output is almost correct but it does not include the match text.

3
  • You should just iterate over the list and check if each value .startswith('line1'), or 'line2', etc. Commented Mar 30, 2017 at 17:50
  • Correct. But you cant capture 'line1', 'line2' and 'line3' at a go. Commented Mar 30, 2017 at 17:53
  • By 'the match text' , if you're saying findall() does not include group 0 in the output array, just add a capture group around the whole regex (<your regex>) Example (^line1(.*)\nline2(.*)\nline3(.*)$) Commented Mar 30, 2017 at 18:47

2 Answers 2

2

If you're looking for a sequence of line 1,2, and 3 with no duplicates
it is this

line1.*\s*(?!\s|line[13])line2.*\s*(?!\s|line[12])line3.*

Explained

 line1 .* \s*             # line 1 plus newline(s)
 (?! \s | line [13] )     # Next cannot be line 1 or 3 (or whitespace)
 line2 .* \s*             # line 2 plus newline(s)
 (?! \s | line [12] )     # Next cannot be line 1 or 2 (or whitespace)
 line3 .*                 # line 3 

If you want to capture the line content, just put capture groups around (.*)

Sign up to request clarification or add additional context in comments.

3 Comments

Your example doesn't seems to work. It matches all the lines and gives. Closest one I have got is posted in edit section of original post.
Read the last line If you want to capture the line content, just put capture groups around (.*) To me, it was more important to show the assertion without the capture group clutter.
You are correct. I added your regex to the edited code in OP and it works now. Thank you.
1

This may not be the sharpest way (you may want to use regular expressions), but does output what you want:

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
output = []
text = str
line1 = ""
line2 = ""
line3 = ""
prevStart = ""
for text in sample_list:
    if prevStart == "":
        if text.startswith("line1"):
            prevStart = "line1"
            line1 = text
    elif prevStart == "line1":
        if text.startswith("line2"):
            prevStart ="line2"
            line2 = text
        elif text.startswith("line1"):
            line1 = text
            prevStart = "line1"
        else:
            prevStart = ""
    elif prevStart == "line2":
        if text.startswith("line3"):
            prevStart = ""
            line3 = text
        else:
            prevStart = ""
    if line1 != "" and line2 != "" and line3 != "":
        output.append(line1)
        output.append(line2)
        output.append(line3)
        line1 = ""
        line2 = ""
        line3 = ""

for line in output:
    print line

Output for this code is:

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.