Nested for-loop element-wise list comparison

Question

As a novel approach to solving my challenge described here, I have put together the following:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]

for s in diffs:
    others = [i for i in diffs if i != s]
    for j in others:
        if similar(s, j) > 0.7:
            print '"{}" and "{}" refer to the same sentence'.format(s, j)
            print
            diffs.remove(j)
        else:
            print '"{}" is a new sentence'.format(s)

The idea is to loop over the strings, and compare each with the others. If a given string is deemed to be similar to another, remove the other, otherwise the given string is deemed to be a unique string in the list.

Here's the output:

"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence


"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence

So it's correctly detecting that the first two sentences are similar, and that the last is unique. The problem is it's then going back and deeming the first sentence to be unique (which it isn't, and it should not be returning to this sentence anyway).

Where's the flaw in my looping logic? Can this be achieved without nested fors and removal of elements?

You can't delete items from diffs while you're still iterating over it; it will screw up the iteration. Instead, accumulate a list of diffs to delete and delete them at the end. Also, you will likely speed up your code by using itertools.combinations instead of a nested for loop. — BrenBarn
– BrenBarn, Commented Feb 19, 2016 at 21:42

spicavigo · Accepted Answer · 2016-02-19 22:20:32Z

1

from difflib import SequenceMatcher
from collections import defaultdict

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]


sims = set()
simdict = defaultdict(list)
for i in range(len(diffs)):
    if i in sims:
        continue
    s = diffs[i]

    for j in range(i+1, len(diffs)):
        r = diffs[j]
        if similar(s, r) > 0.7:
            sims.add(j)
            simdict[i].append(j)


for k, v in simdict.iteritems():
    print diffs[k] + " is similar to:"
    print '\n'.join(diffs[e] for e in v)

edited Feb 19, 2016 at 22:20

answered Feb 19, 2016 at 21:44

spicavigo

4,2441 gold badge25 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pyderman Over a year ago

Thanks. remaining = diff[:] should read remaining = diffs[:] though. And even with this change, the output suggest the logic isn't doing what it's trying to do: pastebin.com/xQSRjEV5

spicavigo Over a year ago

It should have been if not flag

Pyderman Over a year ago

By making a copy of diffs as you have, I think maintaining .remove() is OK. But you still have the typo (diff / diffs), and your code still doesn't work when that typo is fixed.

Pyderman Over a year ago

Your new answer detects the similar sentences, but does nothing about flagging unique sentences. Both are required. And I'm not sure it's sufficient to infer by process of elimination that a unique sentence is any sentence that doesn't get placed into simdict.

charfellow · Accepted Answer · 2016-02-19 21:59:22Z

0

You can see exactly when it determines the first sentence is unique by changing

print '"{}" is a new sentence'.format(s)

to

print '"{}" and "{}" are different sentences'.format(s,j)

This should help you to see where exactly your loop fails.

answered Feb 19, 2016 at 21:59

charfellow

1012 bronze badges

Comments

Pyderman · Accepted Answer · 2016-02-20 04:22:13Z

0

Since modified strings will always appear back-to-back (one with preceded with '-', the other '+', and '-', the following can be done (and I believe it will work in all cases).

When the list has an odd number of elements, the last must necessarily be a new sentence.

def extract_modified_and_new(diffs):
    for z1, z2 in zip(diffs[::2], diffs[1::2]):
        if similar(z1, z2) > 0.7:
            print z1, 'is similar to', z2
            print
        else:
            print z1, ' and ', z2, 'are new'
            print
    if len(diffs) % 2 != 0:
            print diffs[-1], ' is new'

answered Feb 20, 2016 at 4:22

Pyderman

16.5k17 gold badges65 silver badges111 bronze badges

Collectives™ on Stack Overflow

Nested for-loop element-wise list comparison

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related