0

As a novel approach to solving my challenge described here, I have put together the following:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]

for s in diffs:
    others = [i for i in diffs if i != s]
    for j in others:
        if similar(s, j) > 0.7:
            print '"{}" and "{}" refer to the same sentence'.format(s, j)
            print
            diffs.remove(j)
        else:
            print '"{}" is a new sentence'.format(s)

The idea is to loop over the strings, and compare each with the others. If a given string is deemed to be similar to another, remove the other, otherwise the given string is deemed to be a unique string in the list.

Here's the output:

"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence


"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence

So it's correctly detecting that the first two sentences are similar, and that the last is unique. The problem is it's then going back and deeming the first sentence to be unique (which it isn't, and it should not be returning to this sentence anyway).

Where's the flaw in my looping logic? Can this be achieved without nested fors and removal of elements?

3
  • 3
    DO NOT modify list while iterating over it Commented Feb 19, 2016 at 21:41
  • @spicavigo Right. That much is obvious. Hence the question. Commented Feb 19, 2016 at 21:42
  • 1
    You can't delete items from diffs while you're still iterating over it; it will screw up the iteration. Instead, accumulate a list of diffs to delete and delete them at the end. Also, you will likely speed up your code by using itertools.combinations instead of a nested for loop. Commented Feb 19, 2016 at 21:42

3 Answers 3

1
from difflib import SequenceMatcher
from collections import defaultdict

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]


sims = set()
simdict = defaultdict(list)
for i in range(len(diffs)):
    if i in sims:
        continue
    s = diffs[i]

    for j in range(i+1, len(diffs)):
        r = diffs[j]
        if similar(s, r) > 0.7:
            sims.add(j)
            simdict[i].append(j)


for k, v in simdict.iteritems():
    print diffs[k] + " is similar to:"
    print '\n'.join(diffs[e] for e in v)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. remaining = diff[:] should read remaining = diffs[:] though. And even with this change, the output suggest the logic isn't doing what it's trying to do: pastebin.com/xQSRjEV5
It should have been if not flag
By making a copy of diffs as you have, I think maintaining .remove() is OK. But you still have the typo (diff / diffs), and your code still doesn't work when that typo is fixed.
Your new answer detects the similar sentences, but does nothing about flagging unique sentences. Both are required. And I'm not sure it's sufficient to infer by process of elimination that a unique sentence is any sentence that doesn't get placed into simdict.
0

You can see exactly when it determines the first sentence is unique by changing

print '"{}" is a new sentence'.format(s)

to

print '"{}" and "{}" are different sentences'.format(s,j)

This should help you to see where exactly your loop fails.

Comments

0

Since modified strings will always appear back-to-back (one with preceded with '-', the other '+', and '-', the following can be done (and I believe it will work in all cases).

When the list has an odd number of elements, the last must necessarily be a new sentence.

def extract_modified_and_new(diffs):
    for z1, z2 in zip(diffs[::2], diffs[1::2]):
        if similar(z1, z2) > 0.7:
            print z1, 'is similar to', z2
            print
        else:
            print z1, ' and ', z2, 'are new'
            print
    if len(diffs) % 2 != 0:
            print diffs[-1], ' is new'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.