0

I have a very big text file and I want to filter out some lines. the first line is Identifier and it is followed by many lines (numbers in different lines) like this example:

example:

fixedStep ch=GL000219.1 start=52818 step=1
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
fixedStep ch=GL000320.1 start=52959 step=1
1.000000
1.000000
1.000000
fixedStep ch=M start=52959 step=1
1.000000
1.000000

this line is identifier: fixedStep ch=GL000219.1 start=52818 step=1 I want to filter out all identifier lines containing ch=GL000219.1 and ch=GL000320.1 and the following lines (the numbers) and keep other identifiers and the corresponding lines (numbers) below them. each identifier is repeated multiple times. like this output:

fixedStep ch=M start=52959 step=1
1.000000
1.000000

I have tried this code:

l = ["ch=GL000219.1", "ch=GL000320.1"] # since I have more identifiers that should be removed 

with open('file.txt', 'r') as f:
    with open('outfile.txt', 'w') as outfile:
        good_data = True
        for line in f:
            if line.startswith('fixedStep'):
                for i in l:
                    good_data = i not in line
            if good_data:
                outfile.write(line)

my code does not return what I want. do you know how to modify the code?

5
  • Add a break under good_data = i not in line if it ever becomes False. good_data can take multiple values for a single line because it's overwriting itself, so it only has to be True for the last value of i Commented Jul 26, 2017 at 12:33
  • Also, good_data needs to reset for every line, no? Commented Jul 26, 2017 at 12:35
  • I tried but does not make difference. Commented Jul 26, 2017 at 12:39
  • There's a few changes you need to make if I understand your question correctly. What did you try? Commented Jul 26, 2017 at 12:39
  • if I do not call the list and try the identifiers one by one it works for one of them each time perfectly but it took me lot of time to try that for all of them. I would like to do that for all identifiers at once. Commented Jul 26, 2017 at 12:42

2 Answers 2

1

You placed this line in the wrong place:

good_data = True

Once it is set to false, it won't to be true again.

You can write like this:

l = ["ch=GL000219.1", "ch=GL000320.1"]
flag = False                                                                        

with open('file.txt', 'r') as f, open('outfile.txt', 'w') as outfile:                                                                                
    for line in f:                                                                  
        if line.strip().startswith("fixedStep"):                                    
            flag = all(i not in line for i in l)                                    
        if flag:                                                                    
            outfile.write(line)    
Sign up to request clarification or add additional context in comments.

4 Comments

it removes every line below the identifiers even the ones that I am interested in
@john what do you mean by "removes every line", I didn't understand !
every identifier has some lines below (like the example). I would like to remove the some of the identifiers that I am not interested in and the following lines. indeed there are also some identifiers that I am interested in and I want them and corresponding lines that are below them. like example
@john I understand. I updated the code, is that what you want ?
0

you need to split strings(the content of the text file)into lines after you read them from a text file . using

print(f)

after read to f, you will find that is a string not lines.

if it's a unix ending text file,using

f = f.split("\n")

to convert string to list, then you can loop it by lines.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.