efficient algorithm for comparing two lists

Question

I'm building a similarity matrix of a list of items. The naive approach is to iterate the list twice, but this needlessly will compare A:B and B:A when they're the same.

for A in items:
   for B in items:
      if A==B: continue
      sim[A][B] = calc_sim(A, B)

is there a simple way to only calculate half of the values? I could put a skip in there like

if sim[B][A]: continue # already calculated in other direction

But still the iteration is happening. Effectively I just want to iterate through the top or bottom half of the grid:

There are some similar Qs, but nothing with a canonical answer. This seems like a basic CS algo question!

the above is pseudo-code. i am writing python in this case tho yes but it's more about an algorithm language independent. I know i can use libraries like sklearn for NearestNeighbot but more interested in the raw algo for myself. added a python tag anyway — dcsan
– dcsan, Commented Jan 2, 2021 at 22:33
@superbrain I think you could be right! simplest is best haha. — dcsan
– dcsan, Commented Jan 3, 2021 at 6:08

abc · Accepted Answer · 2021-01-02 22:40:22Z

3

You could use itertools.combinations.

import itertools

for a, b in itertools.combinations(items, 2):
    sim[a][b] = sim[b][a] = calc_sim(a, b)

answered Jan 2, 2021 at 22:40

abc

12k2 gold badges30 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dcsan Over a year ago

oh thanks for pointing this out. it's a bit python specific, but itertools is a great toolbox.

fdermishin · Accepted Answer · 2021-01-02 22:57:31Z

0

If you need just a general algorithm to reduce number iterations, you can limit the range of the inner loop

for i, A in enumerate(items):
   for B in items[:i]:
      sim[A][B] = calc_sim(A, B)

But if you are looking for Python-specific optimization, it would be much better to use numpy vectorization. For example, if calc_sim(a, b) computes squared difference between a and b, then it can be vectorized the following way:

import numpy as np

list = [1, 2, 3]
array = np.array(list)
sim = np.square(array[:,np.newaxis] - array)

[[0 1 4]
 [1 0 1]
 [4 1 0]]

edited Jan 2, 2021 at 22:57

answered Jan 2, 2021 at 22:42

fdermishin

3,7443 gold badges28 silver badges50 bronze badges

2 Comments

dcsan Over a year ago

it's actually an array of word vectors I'm using to do the comparison. Can np be made to take any comparison function and apply it to a grid in that way, or just np built-ins like np.square ?

fdermishin Over a year ago

@dcsan It can be done if the vectors have the same length. I think that this question may help: stackoverflow.com/questions/35215161/… But if the vectors have different lengths, than numpy is unlikely to be able to handle them, as long as they are not preprocessed in some way

mandulaj · Accepted Answer · 2021-01-02 23:01:05Z

0

Assuming calc_sim(A, B) == calc_sim(B, A), you could try this:

for A in range(0, len(items)):
   for B in range(A, len(items)): # Replace with A+1 if you don't want the case A == B
      # Remember A and B are indexes, so change code accordingly
      result = calc_sim(items[A], items[B])
      sim[A][B] = result # Copy result to both A,B and B,A as they are equal
      sim[B][A] = result

However actually both algorithms are O(n) n²

edited Jan 2, 2021 at 23:01

answered Jan 2, 2021 at 22:39

mandulaj

7733 silver badges10 bronze badges

4 Comments

trincot Over a year ago

Now calc_sum will get indices, not the actual values

mandulaj Over a year ago

Yeah, I made a note about it, but then @abc has a better answer.... stackoverflow.com/a/65544844/2180316

dcsan Over a year ago

for B in range(A, ... nice trick, this will effectively cut off one half of the mirror image.

mandulaj Over a year ago

Its not very Pythonic, you should use the itertools method.

Collectives™ on Stack Overflow

efficient algorithm for comparing two lists

3 Answers 3

1 Comment

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related