efficient string comparison in python including numerical evaluation

Question

I have two moderately large ascii files containing data in a fixed format. I need to test if 6 given fields in a line of the first file match (within a given tolerance) six fields on any line of the second file then output a common line to continue processing.

I am currently spliting each line in a file using a fortran style line reader, and generating a list of lists with the correct type for each element in each list. I am storing the lists of lists from both files in memory wihilst I operate on them

The fields I need to compare are all floats and I am currently using the following type of flow:

tol = 0.01
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2

The execution of this is fine over a file containing only a small number of lines, but over 30000 lines the execution of this part alone is over 1 min!

I am fairly certain there must be a much faster comparison method but I am struggling to find it, any help would be appreciated.

I'm confused. You say you only need to look at the first line of File1, but then it looks like you look at all the lines of file1 ... Also, do you have numpy? That could probably speed things up a bit. — mgilson
– mgilson, Commented Jul 23, 2012 at 17:58
sorry not to have been clearer @mgilson - i need to loop through the whole of File1 and check to see if there are common data with File2. I do have numpy installed but am unclear how I can use this to speed it up - numpy is a bit of a new adventure and up until now I have never needed to optimise my code at all, having only used python for simple file manipulations in the past — mike_pro
– mike_pro, Commented Jul 23, 2012 at 18:00

mgilson · Accepted Answer · 2012-07-23 18:25:31Z

2

If you can store the elements in file1list and file2list as numpy arrays, your comparison becomes a good bit more simple (and probably faster too):

for entry1 in file1list:
    for entry2 in file2list:
        if(np.all(np.abs(entry1[1:7] - entry2[1:7]) > tol)):
            print entry1,entry2

Converting to numpy arrays is painless:

file1list = [ np.array(entry) for entry in file1list ]

This gets even a little bit better for a sorted file2list

file2list=sorted(file2list,key=operator.itemgetter(1,2,3,4,5,6))
for entry1 in file1list:
    for entry2 in file2list:
        d=np.abs(entry1[1:7]-entry2[1:7])
        if(np.all(d < tol)):
            print entry1,entry2
        elif(d[0] > tol):
            break

edited Jul 23, 2012 at 18:25

answered Jul 23, 2012 at 18:16

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Danica Over a year ago

Probably better to make the whole file2list a numpy structured array and use np.ndarray.sort while you're at it, just for memory efficiency.

mgilson Over a year ago

@Dougal -- I thought about making the whole thing a pair of 2D arrays, but I don't think the memory is a very big issue here and I figured that I would try to preserve the structure of the original code as much as possible. But you're absolutely correct that it is possible (and possibly a better way to store the data -- depending on what happens to the data later). -- also, based on my experience, I'm not very confident I would get it right if I attempted that without having a good test to try it on ;)

mike_pro Over a year ago

@mgilson this looks like a route I will investigate further - thanks a lot, I have never used a forum like this before, it is great that help is so forthcoming - can I put the question into a different state whilst I try the various points that have been raised by you and others?

mgilson Over a year ago

@mike_pro -- people will just keep on adding answers until you accept one (and sometimes even afterward if they have a much better solution). But don't worry about it. Take your time, figure out which answer suits you best. Part of the fun is learning from other people's answers. (I personally think the ideas behind Dougal 's solution are brilliant -- And bisecting as per DSM's soltution is a great idea too.)

mike_pro Over a year ago

@mgilson don't get me wrong - I will try those methods too - they just require me to do a bit more reading first to understand what is happening :-) I will get there though - even if it takes a while and it normally does - thanks again.

Andrew Clark · Accepted Answer · 2012-07-23 18:16:45Z

The issue with your current method is that you are comparing every line in the first file to every line in the second file. To cut down on the running time you need a method to short-circuit the second for loop, with some logic like "if this iteration matches some condition, then I can break out of this loop because it is impossible that any of the subsequent entries can match".

To support this logic, you will first need to sort both lists, for example:

from operator import itemgetter
comp_fields = itemgetter(1, 2, 3, 4, 5, 6)
file1List.sort(key=comp_fields)
file2List.sort(key=comp_fields)
for entry1 in file1List:
    for entry2 in file2List:
        if (abs(entry1[1] - entry2[1]) < tol and abs(entry1[2] - entry2[2]) < tol 
            and abs(entry1[3] - entry2[3]) < tol and abs(entry1[4] - entry2[4]) < tol
            and abs(entry1[5] - entry2[5]) < tol and abs(entry1[6] - entry2[6]) < tol):
            print entry1,entry2
        elif entry2[1] - entry1[1] >= tol:
            # we know subsequent lines in file2List can't match due to ordering
            break

This is a pretty naive solution, for example one possible improvement would be to keep track of starting points, for example if on the previous iteration of file1List we went 2000 lines into file2List before finding a match, on the next iteration of file1List we can start the iteration of file2List at line 2000. It also only does the short-circuiting on the first column, and you may be able to add some additional logic to have it short-circuit on the subsequent columns as well (this gets kind of tricky due to there being a tolerance instead of requiring exact matches).

Danica · Accepted Answer · 2012-07-23 18:24:12Z

1

Your current processing is O(n m), where n is the number of lines in file 1 and m the number in file 2. When n ~ m ~ 30,000, it's doing 900,000,000 comparisons: no surprise it takes a minute!

This is made a little more complicated by the fact that you're allowed to be off by a tolerance: if it were exact matching, you could just make a dictionary of the values from one of the files and then have O(m) time to build the dictionary and O(n) time to do the lookups, since hash-based dictionaries have constant-time lookups.

One possibility that's somewhat similar would be to make a dictionary of rounded values, so that things within tol are keyed by the same value, and then make sure that everything's within tol when you find a possible match. You could do that by rounding each of the entries to something slightly bigger than tol: that is, if tol is 1e-3, key based on the entries rounded to 1e-2. How effective this is depends on the distribution of your values versus tol, but it should be pretty good.

That is, do something like this (untested, but it should probably work):

 import math
 import operator

 cmp_fields = operator.itemgetter(1, 2, 3, 4, 5, 6)

 # higher-order function to get a key-getting function for a given tolerance
 def key_getter(tol):
     prec = -math.log10(tol) - 1
     def inner(entry):
         return tuple(math.round(x, prec) for x in cmp_fields(entry))
     return inner

 # make the key function we want here
 key_fn = key_getter(tol)

 # build the dictionary of entries from file2: key maps to a list of entries
 file_2_entries = collections.defaultdict(list)
 for entry in file2List:
     file_2_entries[key_fn(entry)].append(entry)

 for entry1 in file1List: # for each entry from file1...
     # check only the potential matches based on rounding from 2
     for entry2 in file_2_entries[key_fn(entry2)]:
         if all(abs(x1 - x2) < tol for x1, x2
                in zip(map(cmp_fields, (entry1, entry2)))):
             print entry1, entry2

edited Jul 23, 2012 at 18:24

answered Jul 23, 2012 at 18:03

Danica

29k6 gold badges94 silver badges128 bronze badges

2 Comments

mike_pro Over a year ago

yeah I understand the reason for my current club handed approach taking a long time to evaluate, and yes if it was an exact comparison it would make it easier I think. I am not sure how to implement a dictionay requiring multiple keys to the final value, unless it was a dictionary of dictionaries (several deep).

Danica Over a year ago

@mike_pro See the code in my edit: in Python you can just use tuples as dictionary keys.

LSerni · Accepted Answer · 2012-07-23 18:37:01Z

0

You can sort both file1List and file2List according to the value of

entry1[1]^2 + entry1[2]^2 + ... + entry1[6]^2

These are two lists of vector metrics in six-space. Given a vector in file1List, you can find the position of vectors of the same length using the Newton-Raphson method (or also bisection: Python [already supports][1] this.

Scanning for the next values in file2List matching the next value of file1List, you will only need to scan forward, and the first value in file2List exceeding tol will mean the end of the search for that particular value of file1List.

I believe the algorithm for the original JOIN utility was more or less similar.

answered Jul 23, 2012 at 18:37

LSerni

57.7k10 gold badges70 silver badges110 bronze badges

Collectives™ on Stack Overflow

efficient string comparison in python including numerical evaluation

4 Answers 4

5 Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related