I have two files, in one of which I have a list of loci (Loci.txt) (about 16 million to be exact) and in a second file I have a list of line numbers (Pos.txt). What I want to do is write only the lines from the Loci.txt that are specified in the Pos.txt file to a new file. Below is a truncated version of the two files:
Loci.txt
R000001 1
R000001 2
R000001 3
R000001 4
R000001 5
R000001 6
R000001 7
R000001 8
R000001 9
R000001 10
Pos.txt
1
3
5
9
10
Here is the code I have written for the task
#!/usr/bin/env python
import os
import sys
F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]
File1 = open(F1).readlines()
File2 = open(F2).readlines()
File3 = open(F3, 'w')
Lines = []
for line in File1:
Lines.append(int(line))
for i, line in enumerate(File2):
if i+1 in Lines:
File3.write(line)
The code works exactly like I want it to and the output looks like this
OUT.txt
R000001 1
R000001 3
R000001 5
R000001 9
R000001 10
The problem is that when I apply this to my whole data set where I have to pull some 13 million lines from a file containing 16 million lines it takes forever to complete. Is there anyway I can write this code so that it will run faster?
File2in one time first to save memory. Plus, you should maybe write with a memory-buffer instead of doing a.write()at each line found.