3

I want to loop through a database of documents and calculate a pairwise comparison score.

A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.

Is there a name for the algorithm for doing this task efficiently? Is there a name for this approach?

Thanks.

4 Answers 4

3

Assume all items have a number ItemNumber

Simple solution -- always have the 2nd element's ItemNumber greater than the first item.

eg

for (firstitem = 1 to maxitemnumber)
  for (seconditem = firstitemnumber+1 to maxitemnumber)
    compare(firstitem, seconditem)

visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.

........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
Sign up to request clarification or add additional context in comments.

4 Comments

I thought about that too, but maybe the OP does not want to change model classes.
@Felix : Cool... tell me what you mean by that? Model class? What is that exactly?
Well I assume that the OP somehow use a class to represent documents (the model) (maybe he uses an ORM). But I just realized that this is easy to do with list indexes. First I thought you mean to make the documents comparable. I still think you could provide a better example ;). But nevermind.
@Felix : I get it, you were implementing a solution for sets as op. to arrays / collections with a unique item number. You solution is of course the way to do it with sets. Rare that you don't have an underlying unique index to an array, but it could happen -- mostly with functional languages and non-relational DBs.
2

I don't think its complicated enough to qualify for a name.

You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.

Unique pairings:

SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id

Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.

C.

Comments

0

You can keep track of which documents you have already compared, e.g. (with numbers ;))

compared = set()

for i in [1,2,3]:
    for j in [1,2,3]:
        pair =  frozenset((i,j))
        if i != k and pair not in compared:
            compare.add(pair)
            compare(i,j)

Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.

Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:

docs = [1,2,3]
l = len(docs)
for i in range(l):
    for j in range(i+1,l):
        compare(l[i],l[j])

3 Comments

My solution is much faster. No overhead of the pair stuff.
Doesn't this implementation turn the complexity from an n^2 to an n^3? Not only do you have the nested loop, but the "pair not in compared" is an O(n) operation. There are easier/more efficient implementations. But it seems like @seanieb is satisfied.
@Traveling Tech Guy: True, I changed from lists to set and frozenset. I think, a set has access of O(1) as it is implemented as dictionary keys.
0

Something like this?

src = [1,2,3]
for i, x in enumerate(src):
    for y in src[i:]:
        compare(x, y)

Or you might wish to generate a list of pairs instead:

pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.