3

I am trying out the difflib library. I have two lists: L_1 and L_2 containing strings. I want to know, if those sequences are similar (order is not important).

L_1 = ["Bob", "Mary", "Hans"]
L_2 = ["Bob", "Marie", "Háns"]

should be ok. But

L_1 = ["Nirdosch", "Mary", "Rolf"]
L_2 = ["Bob", "Marie", "Háns"]

should not be ok.

I came up with the idea of iterating over the first list L_1 and to match every element of L_1 by the method

difflib.get_close_matches()

against the second list L_2. If there was a match with a ratio bigger then let's say 0.7 remove it from L_2 and continue. But I doubt it is a good plan. Is there a better one?

1 Answer 1

2

I would do something like:

import difflib

L_1 = ["Bob", "Mary", "Hans"]
L_2 = ["Bob", "Marie", "Hans"]

def similiarity(L_1, L_2):
    L_1 = set(intern(w) for w in L_1)
    L_2 = set(intern(w) for w in L_2)

    to_match = L_1.difference( L_2)
    against = L_2.difference(L_1)
    for w in to_match:
        res = difflib.get_close_matches(w, against)
        if len(res):
            against.remove( res[0] )
    return (len(L_2)-len(against)) / (len(L_1))

print similiarity(L_1,L_2)
Sign up to request clarification or add additional context in comments.

4 Comments

Nice solution - however I would in general implement my own comparison based on the Levenshtein distance.
I thought difflib uses Levenshtein under the hood
First: Thanks for your answer! I read up on intern(), but I didn't really get what it means. Would you be so kind and give me a hint?
if you create two strings with the same value, those strings are different objects. With intern the second time you try to create a string that have already been created, it will return the same string object. In this way comparing two strings is just a matter of object address, constant, no matter how long is the string

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.