Compare DB row values efficiently

Question

I want to loop through a database of documents and calculate a pairwise comparison score.

A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.

Is there a name for the algorithm for doing this task efficiently? Is there a name for this approach?

Thanks.

Hogan · Accepted Answer · 2010-03-11 15:44:27Z

3

Assume all items have a number ItemNumber

Simple solution -- always have the 2nd element's ItemNumber greater than the first item.

eg

for (firstitem = 1 to maxitemnumber)
  for (seconditem = firstitemnumber+1 to maxitemnumber)
    compare(firstitem, seconditem)

visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.

........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.

answered Mar 11, 2010 at 15:44

Hogan

70.7k10 gold badges83 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Felix Kling Over a year ago

I thought about that too, but maybe the OP does not want to change model classes.

Hogan Over a year ago

@Felix : Cool... tell me what you mean by that? Model class? What is that exactly?

Felix Kling Over a year ago

Well I assume that the OP somehow use a class to represent documents (the model) (maybe he uses an ORM). But I just realized that this is easy to do with list indexes. First I thought you mean to make the documents comparable. I still think you could provide a better example ;). But nevermind.

Hogan Over a year ago

@Felix : I get it, you were implementing a solution for sets as op. to arrays / collections with a unique item number. You solution is of course the way to do it with sets. Rare that you don't have an underlying unique index to an array, but it could happen -- mostly with functional languages and non-relational DBs.

symcbean · Accepted Answer · 2010-03-11 15:41:03Z

2

I don't think its complicated enough to qualify for a name.

You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.

Unique pairings:

SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id

Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.

C.

answered Mar 11, 2010 at 15:41

symcbean

48.4k6 gold badges63 silver badges99 bronze badges

Comments

Felix Kling · Accepted Answer · 2010-03-11 16:03:27Z

0

You can keep track of which documents you have already compared, e.g. (with numbers ;))

compared = set()

for i in [1,2,3]:
    for j in [1,2,3]:
        pair =  frozenset((i,j))
        if i != k and pair not in compared:
            compare.add(pair)
            compare(i,j)

Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.

Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:

docs = [1,2,3]
l = len(docs)
for i in range(l):
    for j in range(i+1,l):
        compare(l[i],l[j])

edited Mar 11, 2010 at 16:03

answered Mar 11, 2010 at 15:37

Felix Kling

820k181 gold badges1.1k silver badges1.2k bronze badges

3 Comments

Hogan Over a year ago

My solution is much faster. No overhead of the pair stuff.

Traveling Tech Guy Over a year ago

Doesn't this implementation turn the complexity from an n^2 to an n^3? Not only do you have the nested loop, but the "pair not in compared" is an O(n) operation. There are easier/more efficient implementations. But it seems like @seanieb is satisfied.

Felix Kling Over a year ago

@Traveling Tech Guy: True, I changed from lists to set and frozenset. I think, a set has access of O(1) as it is implemented as dictionary keys.

medina · Accepted Answer · 2010-03-11 17:32:47Z

0

Something like this?

src = [1,2,3]
for i, x in enumerate(src):
    for y in src[i:]:
        compare(x, y)

Or you might wish to generate a list of pairs instead:

pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

answered Mar 11, 2010 at 17:32

medina

261 silver badge4 bronze badges

Collectives™ on Stack Overflow

Compare DB row values efficiently

4 Answers 4

4 Comments

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related