0

Team,

I want to extract some lines using a string(starts with tg_) from a file and i get the output as per below regex..the question is,

  1. I am not sure how to extract the line if goes for 2 lines ends with \ like below.

  2. I don't know how to remove the special characters with the below existing below regexp.

*****from a file*******

tg_cr_counters dghbvcvgfv

tg_kk_bb a group1 bye bye bye hi hi hi 1 \ <<<<
patch mac hdfh f dgf asadasf \
dgfgmnhnjgfg

tg_cr_counters gthghtrhgh }} ] <<<<<

tg_cr_counters fkgnfkmngvd

import re

file = open("C:\\Users\\input.tcl", "r")
f1 = file.readlines()

output = open("extract.txt", "a+")

match_list = [ ]   

for item in f1:

    match_list = re.findall(r'[t][g][_]+\w+.*', item)
    if(len(match_list)>0):
        output.write(match_list[0]+"\r\n")
        print(match_list)
3
  • Can we assume that when there is a single newline, then the line is continuing, and that if there are two consecutive newlines it is not? I'm not clear on what you want to extract. Commented Nov 28, 2018 at 17:07
  • what is the end codition of a multiline match ? Commented Nov 28, 2018 at 17:12
  • @JETM the line which i want to grep is a multiline ends with "\" , when i use regex it extracts only the first line which is ending with \, not second and third line Commented Nov 30, 2018 at 5:45

1 Answer 1

1

You can use regex with flags for re.MULTILINE and re.DOTALL.

This way a . will also match \n and you can look for anything that starts with tg_ (no need to put each in []) and ends with a double \n\n (or end of text) \Z:

fn = "t.txt"
with open (fn,"w") as f: 
    f.write("""*****from a file*******

tg_cr_counters dghbvcvgfv

tg_kk_bb a group1 bye bye bye hi hi hi 1 \ <<<<
patch mac hdfh f dgf asadasf \
dgfgmnhnjgfg

tg_cr_counters gthghtrhgh }} ] <<<<<

tg_cr_counters fkgnfkmngvd
""")

import re

with open("extract.txt", "a+") as o, open(fn) as f:
    for m in re.findall(r'^tg_.*?(?:\n\n|\Z)', f.read(), flags=re.M|re.S):
        o.write("-"*40+"\r\n")
        o.write(m)
        o.write("-"*40+"\r\n")

with open("extract.txt")as f:
    print(f.read())

Output (each match is between a line of ----------------------------------------):

----------------------------------------
tg_cr_counters dghbvcvgfv

----------------------------------------
----------------------------------------
tg_kk_bb a group1 bye bye bye hi hi hi 1 \ <<<<
patch mac hdfh f dgf asadasf dgfgmnhnjgfg

----------------------------------------
----------------------------------------
tg_cr_counters gthghtrhgh }} ] <<<<<

----------------------------------------
----------------------------------------
tg_cr_counters fkgnfkmngvd
----------------------------------------

re.findall() result looks like:

['tg_cr_counters dghbvcvgfv\n\n', 
 'tg_kk_bb a group1 bye bye bye hi hi hi 1 \\ <<<<\npatch mac hdfh f dgf asadasf dgfgmnhnjgfg\n\n', 
 'tg_cr_counters gthghtrhgh }} ] <<<<<\n\n', 
 'tg_cr_counters fkgnfkmngvd\n']

To enable multiline-searches you need to read in more then one line at a time - if your file is humongeous this will lead to memory problems.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks..but you are trying to extract form the output, it is there to extract from the input file where multilines are present starting with tg_
@CharlesDaniel I am creating a file that holds all the lines on top of the code. Then I am extracting from the whole file ` f.read()` the lines you see between -------------------- - they are multiline matches. I don't quite get what you imply... t.txt holds 4 multilines starting with tg_ and a line in front with '*****from a file*******' in it that is not captured because it does not start with tg_

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.