write specific lines from a file python

Question

I have two files, in one of which I have a list of loci (Loci.txt) (about 16 million to be exact) and in a second file I have a list of line numbers (Pos.txt). What I want to do is write only the lines from the Loci.txt that are specified in the Pos.txt file to a new file. Below is a truncated version of the two files:

Loci.txt

Pos.txt

Here is the code I have written for the task

#!/usr/bin/env python

import os
import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1).readlines()
File2 = open(F2).readlines()
File3 = open(F3, 'w')
Lines = []

for line in File1:
    Lines.append(int(line))

for i, line in enumerate(File2):
    if i+1 in Lines:
        File3.write(line)

The code works exactly like I want it to and the output looks like this

OUT.txt

The problem is that when I apply this to my whole data set where I have to pull some 13 million lines from a file containing 16 million lines it takes forever to complete. Is there anyway I can write this code so that it will run faster?

Well... you are not forced to read the whole File2 in one time first to save memory. Plus, you should maybe write with a memory-buffer instead of doing a .write() at each line found. — Maxime Lorant
– Maxime Lorant, Commented Jun 4, 2014 at 7:51
i have one thing say pos file is of what length in real time — sundar nataraj
– sundar nataraj, Commented Jun 4, 2014 at 7:58

Nicolas Defranoux · Accepted Answer · 2014-06-04 08:28:31Z

1

You code is slow mostly because you are searching in a list if the line you have have to be printed : if i+1 in Lines. Each time your programs scans the full list to find if the line number is in or not.
You can replace:

Lines = []

for line in File1:
    Lines.append(int(line))

By:

Lines = {}

for line in File1:
    Lines[int(line)] = True

answered Jun 4, 2014 at 8:28

Nicolas Defranoux

2,6761 gold badge13 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zhewriix · Accepted Answer · 2014-06-04 08:09:22Z

0

You could try something like this:

import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1)
File2 = open(F2)
File3 = open(F3, 'w')

for linenumber in File2:
    for line in File1:
        if linenumber in line:
            File3.write(line)
            break

This might look terrible due to the nested for-loops, but since we are iterating over the lines of a file, the script will simply continue from where it left off when the last line was discovered. This is because of how reading of files work, where a pointer is used to keep track of your location in the file. In order to read from the beginning of the file again, you would have to use the seek function to move the pointer to the file's start.

answered Jun 4, 2014 at 8:09

Zhewriix

14810 bronze badges

2 Comments

iksaglam Over a year ago

the code is fast but when I apply it to the true data I end up with only 589 line instead of 13398648

Zhewriix Over a year ago

Is it possible that either the Loci or Pos files' line numbers aren't in order?

Burhan Khalid · Accepted Answer · 2014-06-04 08:35:04Z

As others have mentioned, reading the entire file in memory first is what is causing the problem(s). Here is an alternative approach, which scans the large file and writes out only those lines that match.

with open('search_keys.txt', 'r') as f:
    filtered_keys = [line.rstrip() for line in f]

with open('large_file.txt', 'r') as haystack, open('output.txt', 'w') as results:
    for line in haystack:
        if len(line.strip()):  #  This to skip blanks
            if line.split()[1] in filtered_keys:
                results.write('{}\n'.format(line))

This way you only read the big file one line at a time and write out the results at the same time.

Keep in mind that this won't sort the output.

If your search_keys.txt file is very large, converting filtered_keys to a set will improve look up times.

Salvatore Avanzo · Accepted Answer · 2014-06-04 08:35:53Z

0

You can try with this code :

#!/usr/bin/env python

with open("loci.txt") as File1:
    lociDic = {int(line.split()[1]): line.split()[0] for line in File1}

with open("pos.txt") as File2:
    with open("result.txt", 'w') as File3:
        for line in File2:
            if int(line) in lociDic:
                File3.write(' '.join([lociDic[int(line)], line]))

Key points in this solution are:

Create enumerate in the first step (a dictionary is used)
Avoid to read entire File2 at once (using with statement)

Also I use integers (code) contained in File1 and File2 because I suppose there is a possibility of holes in File1 sequence. Other solutions are possible otherwise.

edited Jun 4, 2014 at 8:35

answered Jun 4, 2014 at 8:16

Salvatore Avanzo

2,7841 gold badge25 silver badges32 bronze badges

4 Comments

iksaglam Over a year ago

the code works real fast but for some reason I only get 589 lines instead of 13398648

Salvatore Avanzo Over a year ago

@iksaglam : Maybe some row in your file don't match with the pattern that you expose. I'm sorry.

Nicolas Defranoux Over a year ago

This code does not output lines which number are in pos.txt, it outputs lines containing a number after a space when the number is in pos.txt.

Salvatore Avanzo Over a year ago

@NicolasDefranoux: exact, how I wrote at bottom. I didn't suppose that pos.txt contains the loci.txt number of line to retain but a code that has to match with which contained in pos.txt. That's all.

Collectives™ on Stack Overflow

write specific lines from a file python

4 Answers 4

Comments

2 Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related