3

I have two moderately large ascii files containing data in a fixed format. I need to test if 6 given fields in a line of the first file match (within a given tolerance) six fields on any line of the second file then output a common line to continue processing.

I am currently spliting each line in a file using a fortran style line reader, and generating a list of lists with the correct type for each element in each list. I am storing the lists of lists from both files in memory wihilst I operate on them

The fields I need to compare are all floats and I am currently using the following type of flow:

tol = 0.01
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2

The execution of this is fine over a file containing only a small number of lines, but over 30000 lines the execution of this part alone is over 1 min!

I am fairly certain there must be a much faster comparison method but I am struggling to find it, any help would be appreciated.

2
  • I'm confused. You say you only need to look at the first line of File1, but then it looks like you look at all the lines of file1 ... Also, do you have numpy? That could probably speed things up a bit. Commented Jul 23, 2012 at 17:58
  • sorry not to have been clearer @mgilson - i need to loop through the whole of File1 and check to see if there are common data with File2. I do have numpy installed but am unclear how I can use this to speed it up - numpy is a bit of a new adventure and up until now I have never needed to optimise my code at all, having only used python for simple file manipulations in the past Commented Jul 23, 2012 at 18:00

4 Answers 4

2

If you can store the elements in file1list and file2list as numpy arrays, your comparison becomes a good bit more simple (and probably faster too):

for entry1 in file1list:
    for entry2 in file2list:
        if(np.all(np.abs(entry1[1:7] - entry2[1:7]) > tol)):
            print entry1,entry2

Converting to numpy arrays is painless:

file1list = [ np.array(entry) for entry in file1list ]

This gets even a little bit better for a sorted file2list

file2list=sorted(file2list,key=operator.itemgetter(1,2,3,4,5,6))
for entry1 in file1list:
    for entry2 in file2list:
        d=np.abs(entry1[1:7]-entry2[1:7])
        if(np.all(d < tol)):
            print entry1,entry2
        elif(d[0] > tol):
            break 
Sign up to request clarification or add additional context in comments.

5 Comments

Probably better to make the whole file2list a numpy structured array and use np.ndarray.sort while you're at it, just for memory efficiency.
@Dougal -- I thought about making the whole thing a pair of 2D arrays, but I don't think the memory is a very big issue here and I figured that I would try to preserve the structure of the original code as much as possible. But you're absolutely correct that it is possible (and possibly a better way to store the data -- depending on what happens to the data later). -- also, based on my experience, I'm not very confident I would get it right if I attempted that without having a good test to try it on ;)
@mgilson this looks like a route I will investigate further - thanks a lot, I have never used a forum like this before, it is great that help is so forthcoming - can I put the question into a different state whilst I try the various points that have been raised by you and others?
@mike_pro -- people will just keep on adding answers until you accept one (and sometimes even afterward if they have a much better solution). But don't worry about it. Take your time, figure out which answer suits you best. Part of the fun is learning from other people's answers. (I personally think the ideas behind Dougal 's solution are brilliant -- And bisecting as per DSM's soltution is a great idea too.)
@mgilson don't get me wrong - I will try those methods too - they just require me to do a bit more reading first to understand what is happening :-) I will get there though - even if it takes a while and it normally does - thanks again.
1

The issue with your current method is that you are comparing every line in the first file to every line in the second file. To cut down on the running time you need a method to short-circuit the second for loop, with some logic like "if this iteration matches some condition, then I can break out of this loop because it is impossible that any of the subsequent entries can match".

To support this logic, you will first need to sort both lists, for example:

from operator import itemgetter
comp_fields = itemgetter(1, 2, 3, 4, 5, 6)
file1List.sort(key=comp_fields)
file2List.sort(key=comp_fields)
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2
        elif entry2[1] - entry1[1] >= tol:
            # we know subsequent lines in file2List can't match due to ordering
            break

This is a pretty naive solution, for example one possible improvement would be to keep track of starting points, for example if on the previous iteration of file1List we went 2000 lines into file2List before finding a match, on the next iteration of file1List we can start the iteration of file2List at line 2000. It also only does the short-circuiting on the first column, and you may be able to add some additional logic to have it short-circuit on the subsequent columns as well (this gets kind of tricky due to there being a tolerance instead of requiring exact matches).

Comments

1

Your current processing is O(n m), where n is the number of lines in file 1 and m the number in file 2. When n ~ m ~ 30,000, it's doing 900,000,000 comparisons: no surprise it takes a minute!

This is made a little more complicated by the fact that you're allowed to be off by a tolerance: if it were exact matching, you could just make a dictionary of the values from one of the files and then have O(m) time to build the dictionary and O(n) time to do the lookups, since hash-based dictionaries have constant-time lookups.

One possibility that's somewhat similar would be to make a dictionary of rounded values, so that things within tol are keyed by the same value, and then make sure that everything's within tol when you find a possible match. You could do that by rounding each of the entries to something slightly bigger than tol: that is, if tol is 1e-3, key based on the entries rounded to 1e-2. How effective this is depends on the distribution of your values versus tol, but it should be pretty good.

That is, do something like this (untested, but it should probably work):

 import math
 import operator

 cmp_fields = operator.itemgetter(1, 2, 3, 4, 5, 6)

 # higher-order function to get a key-getting function for a given tolerance
 def key_getter(tol):
     prec = -math.log10(tol) - 1
     def inner(entry):
         return tuple(math.round(x, prec) for x in cmp_fields(entry))
     return inner

 # make the key function we want here
 key_fn = key_getter(tol)

 # build the dictionary of entries from file2: key maps to a list of entries
 file_2_entries = collections.defaultdict(list)
 for entry in file2List:
     file_2_entries[key_fn(entry)].append(entry)

 for entry1 in file1List: # for each entry from file1...
     # check only the potential matches based on rounding from 2
     for entry2 in file_2_entries[key_fn(entry2)]:
         if all(abs(x1 - x2) < tol for x1, x2
                in zip(map(cmp_fields, (entry1, entry2)))):
             print entry1, entry2

2 Comments

yeah I understand the reason for my current club handed approach taking a long time to evaluate, and yes if it was an exact comparison it would make it easier I think. I am not sure how to implement a dictionay requiring multiple keys to the final value, unless it was a dictionary of dictionaries (several deep).
@mike_pro See the code in my edit: in Python you can just use tuples as dictionary keys.
0

You can sort both file1List and file2List according to the value of

entry1[1]^2 + entry1[2]^2 + ... + entry1[6]^2

These are two lists of vector metrics in six-space. Given a vector in file1List, you can find the position of vectors of the same length using the Newton-Raphson method (or also bisection: Python [already supports][1] this.

Scanning for the next values in file2List matching the next value of file1List, you will only need to scan forward, and the first value in file2List exceeding tol will mean the end of the search for that particular value of file1List.

I believe the algorithm for the original JOIN utility was more or less similar.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.