python: comparing two strings

Question

I would like to know if there is a library that will tell me approximately how similar two strings are

I am not looking for anything specific, but in this case:

a = 'alex is a buff dude'
b = 'a;exx is a buff dud'

we could say that b and a are approximately 90% similar.

Is there a library which can do this?

possible duplicate of Text difference algorithm

tzot
– tzot

2010-09-20 14:09:48 +00:00
Commented Sep 20, 2010 at 14:09 — tzot
– tzot, Commented Sep 20, 2010 at 14:09

killown · Accepted Answer · 2010-08-23 21:12:32Z

23

import difflib

>>> a = 'alex is a buff dude'
>>> b = 'a;exx is a buff dud'
>>> difflib.SequenceMatcher(None, a, b).ratio()

0.89473684210526316

edited Aug 23, 2010 at 21:12

answered Aug 23, 2010 at 21:06

killown

4,9573 gold badges27 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Radomir Dopieralski · Accepted Answer · 2010-08-23 20:35:39Z

8

http://en.wikipedia.org/wiki/Levenshtein_distance

There are a few libraries on pypi, but be aware that this is expensive, especially for longer strings.

You may also want to check out python's difflib: http://docs.python.org/library/difflib.html

answered Aug 23, 2010 at 20:35

Radomir Dopieralski

2,60318 silver badges14 bronze badges

2 Comments

John Machin Over a year ago

expensive? difflib is a monster compared to semi-decent Levenshtein implementations.

Radomir Dopieralski Over a year ago

It wasn't my intention to suggest that difflib is less expensive -- it just does a similar, albeit a little different, thing.

viraptor · Accepted Answer · 2010-08-23 20:34:24Z

6

Look for Levenshtein algorithm for comparing strings. Here's a random implementation found via google: http://hetland.org/coding/python/levenshtein.py

answered Aug 23, 2010 at 20:34

viraptor

34.4k13 gold badges116 silver badges204 bronze badges

Comments

Tony Veijalainen · Accepted Answer · 2010-08-24 07:23:46Z

Other way is to use longest common substring. Here a implementation in Daniweb with my lcs implementation (this is also defined in difflib)

Here is simple length only version with list as data structure:

def longest_common_sequence(a,b):

    n1=len(a)
    n2=len(b)

    previous=[]
    for i in range(n2):
        previous.append(0)

    over = 0
    for ch1 in a:
        left = corner = 0
        for ch2 in b:
            over = previous.pop(0)
            if ch1 == ch2:
                this = corner + 1
            else:
                this = over if over >= left else left
            previous.append(this)
            left, corner = this, over
    return 200.0*previous.pop()/(n1+n2)

Here is my second version which actualy gives the common string with deque data structure (also with the example data use case):

from collections import deque

a = 'alex is a buff dude'
b = 'a;exx is a buff dud'

def lcs_tuple(a,b):

    n1=len(a)
    n2=len(b)

    previous=deque()
    for i in range(n2):
        previous.append((0,''))

    over = (0,'')
    for i in range(n1):
        left = corner = (0,'')
        for j in range(n2):
            over = previous.popleft()
            if a[i] == b[j]:
                this = corner[0] + 1, corner[1]+a[i]
            else:
                this = max(over,left)
            previous.append(this)
            left, corner = this, over
    return 200.0*this[0]/(n1+n2),this[1]
print lcs_tuple(a,b)

""" Output:
(89.47368421052632, 'aex is a buff dud')
"""

Collectives™ on Stack Overflow

python: comparing two strings

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related