Python string comparison similarity

Question

I am trying to compare two lists of data which has some free text denoting the same object. example

List 1 ['abc LLC','xyz, LLC']
List 2 ['abc , LLC','xyz LLC']

It is a simple example but the problem is there can be many changes like changes in case or adding some "." in between. Is there any python package that can do the comparison and give a measure of similarity?

@OliCharlesworth I think the author wants to find a percentage of similarity between two strings. Like if the strings are 85% similar. — bezmax
– bezmax, Commented Apr 4, 2012 at 7:52
You don't want "probability", you want "similarity". stackoverflow.com/questions/682367/… — Joe
– Joe, Commented Apr 4, 2012 at 7:52
I believe you've got to define your problem more precisely: what kind of similarity are you detecting? What is the mathematical definition of your similarity? Otherwise people can only guess what you want. Or maybe that is actually your question: you want people to suggest you a similarity definition (like Levenshtein Distance)? — HongboZhu
– HongboZhu, Commented Apr 4, 2012 at 9:31

AKX · Accepted Answer · 2012-04-04 07:54:38Z

7

You could use an implementation of the Levenshtein Distance algorithm for non-precise string matching, for instance this one from Wikibooks.

Another option would be to, for instance, fold everything to lower case, remove spaces, etc. prior to raw comparison -- this of course depends on your use case:

import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
  s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
  s = "".join(c for c in s if c in allowed)
  return s

for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
  print "%r -> %r" % (example, fold(example))

would print

'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'

answered Apr 4, 2012 at 7:54

AKX

171k16 gold badges147 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Not_a_Golfer · Accepted Answer · 2012-04-04 08:21:34Z

3

there's an excellent binary library that uses levenshtein distance (edit distance) between strings to estimate similarity. Give it a try:

https://github.com/miohtama/python-Levenshtein

answered Apr 4, 2012 at 8:21

Not_a_Golfer

49.5k15 gold badges130 silver badges95 bronze badges

Collectives™ on Stack Overflow

Python string comparison similarity

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related