12

I would like to know if there is a library that will tell me approximately how similar two strings are

I am not looking for anything specific, but in this case:

a = 'alex is a buff dude'
b = 'a;exx is a buff dud'

we could say that b and a are approximately 90% similar.

Is there a library which can do this?

1

4 Answers 4

23
import difflib

>>> a = 'alex is a buff dude'
>>> b = 'a;exx is a buff dud'
>>> difflib.SequenceMatcher(None, a, b).ratio()

0.89473684210526316
Sign up to request clarification or add additional context in comments.

Comments

8

http://en.wikipedia.org/wiki/Levenshtein_distance

There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings.

You may also want to check out python's difflib: http://docs.python.org/library/difflib.html

2 Comments

expensive? difflib is a monster compared to semi-decent Levenshtein implementations.
It wasn't my intention to suggest that difflib is less expensive -- it just does a similar, albeit a little different, thing.
6

Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py

Comments

1

Other way is to use longest common substring. Here a implementation in Daniweb with my lcs implementation (this is also defined in difflib)

Here is simple length only version with list as data structure:

def longest_common_sequence(a,b):

    n1=len(a)
    n2=len(b)

    previous=[]
    for i in range(n2):
        previous.append(0)

    over = 0
    for ch1 in a:
        left = corner = 0
        for ch2 in b:
            over = previous.pop(0)
            if ch1 == ch2:
                this = corner + 1
            else:
                this = over if over >= left else left
            previous.append(this)
            left, corner = this, over
    return 200.0*previous.pop()/(n1+n2)

Here is my second version which actualy gives the common string with deque data structure (also with the example data use case):

from collections import deque

a = 'alex is a buff dude'
b = 'a;exx is a buff dud'

def lcs_tuple(a,b):

    n1=len(a)
    n2=len(b)

    previous=deque()
    for i in range(n2):
        previous.append((0,''))

    over = (0,'')
    for i in range(n1):
        left = corner = (0,'')
        for j in range(n2):
            over = previous.popleft()
            if a[i] == b[j]:
                this = corner[0] + 1, corner[1]+a[i]
            else:
                this = max(over,left)
            previous.append(this)
            left, corner = this, over
    return 200.0*this[0]/(n1+n2),this[1]
print lcs_tuple(a,b)

""" Output:
(89.47368421052632, 'aex is a buff dud')
"""

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.