2

I have two files, in one of which I have a list of loci (Loci.txt) (about 16 million to be exact) and in a second file I have a list of line numbers (Pos.txt). What I want to do is write only the lines from the Loci.txt that are specified in the Pos.txt file to a new file. Below is a truncated version of the two files:

Loci.txt

R000001 1
R000001 2
R000001 3
R000001 4
R000001 5
R000001 6
R000001 7
R000001 8
R000001 9
R000001 10

Pos.txt

1
3
5
9
10

Here is the code I have written for the task

#!/usr/bin/env python

import os
import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1).readlines()
File2 = open(F2).readlines()
File3 = open(F3, 'w')
Lines = []

for line in File1:
    Lines.append(int(line))

for i, line in enumerate(File2):
    if i+1 in Lines:
        File3.write(line)

The code works exactly like I want it to and the output looks like this

OUT.txt

R000001 1
R000001 3
R000001 5
R000001 9
R000001 10

The problem is that when I apply this to my whole data set where I have to pull some 13 million lines from a file containing 16 million lines it takes forever to complete. Is there anyway I can write this code so that it will run faster?

3
  • Well... you are not forced to read the whole File2 in one time first to save memory. Plus, you should maybe write with a memory-buffer instead of doing a .write() at each line found. Commented Jun 4, 2014 at 7:51
  • i have one thing say pos file is of what length in real time Commented Jun 4, 2014 at 7:58
  • in real time my pos file is 13398648 in length Commented Jun 4, 2014 at 8:09

4 Answers 4

1

You code is slow mostly because you are searching in a list if the line you have have to be printed : if i+1 in Lines. Each time your programs scans the full list to find if the line number is in or not.
You can replace:

Lines = []

for line in File1:
    Lines.append(int(line))

By:

Lines = {}

for line in File1:
    Lines[int(line)] = True
Sign up to request clarification or add additional context in comments.

Comments

0

You could try something like this:

import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1)
File2 = open(F2)
File3 = open(F3, 'w')

for linenumber in File2:
    for line in File1:
        if linenumber in line:
            File3.write(line)
            break

This might look terrible due to the nested for-loops, but since we are iterating over the lines of a file, the script will simply continue from where it left off when the last line was discovered. This is because of how reading of files work, where a pointer is used to keep track of your location in the file. In order to read from the beginning of the file again, you would have to use the seek function to move the pointer to the file's start.

2 Comments

the code is fast but when I apply it to the true data I end up with only 589 line instead of 13398648
Is it possible that either the Loci or Pos files' line numbers aren't in order?
0

As others have mentioned, reading the entire file in memory first is what is causing the problem(s). Here is an alternative approach, which scans the large file and writes out only those lines that match.

with open('search_keys.txt', 'r') as f:
    filtered_keys = [line.rstrip() for line in f]

with open('large_file.txt', 'r') as haystack, open('output.txt', 'w') as results:
    for line in haystack:
        if len(line.strip()):  #  This to skip blanks
            if line.split()[1] in filtered_keys:
                results.write('{}\n'.format(line))

This way you only read the big file one line at a time and write out the results at the same time.

Keep in mind that this won't sort the output.

If your search_keys.txt file is very large, converting filtered_keys to a set will improve look up times.

Comments

0

You can try with this code :

#!/usr/bin/env python

with open("loci.txt") as File1:
    lociDic = {int(line.split()[1]): line.split()[0] for line in File1}

with open("pos.txt") as File2:
    with open("result.txt", 'w') as File3:
        for line in File2:
            if int(line) in lociDic:
                File3.write(' '.join([lociDic[int(line)], line]))

Key points in this solution are:

  1. Create enumerate in the first step (a dictionary is used)
  2. Avoid to read entire File2 at once (using with statement)

Also I use integers (code) contained in File1 and File2 because I suppose there is a possibility of holes in File1 sequence. Other solutions are possible otherwise.

4 Comments

the code works real fast but for some reason I only get 589 lines instead of 13398648
@iksaglam : Maybe some row in your file don't match with the pattern that you expose. I'm sorry.
This code does not output lines which number are in pos.txt, it outputs lines containing a number after a space when the number is in pos.txt.
@NicolasDefranoux: exact, how I wrote at bottom. I didn't suppose that pos.txt contains the loci.txt number of line to retain but a code that has to match with which contained in pos.txt. That's all.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.