String similarity in Python

Question

I'm trying to do a comparison of strings in Python. My strings contain titles which can be structured a number of different ways:

'Title'
'Title: Subtitle'
'Title - Subtitle'
'Title, Subtitle'
'Title Subtitle'

Is it possible to do similarity comparison in Python so that it can determine that match('Title: Subtitle', 'Title - Subtitle') = True? (or however it would be constructed)

Basically I'm trying to determine if they're the same title even if the splitting is different.

if 'Title: Subtitle' == 'Title - Subtitle':
    match = 'True'
else:
    match = 'False'

There are also some that might be stored as The Title: The Subtitle or Title, The: Subtitle, The although I think that may add a bit of complexity I could probably get around by reconstructing the string.

Why not just remove all punctuation and compare then? See stackoverflow.com/questions/265960/… — Akshat Mahajan
– Akshat Mahajan, Commented Mar 27, 2016 at 21:33
@Liongold that had occurred to me while I was typing up the question, thanks I'll take a look at the link — Midavalo
– Midavalo, Commented Mar 27, 2016 at 21:36
So even The Title: The Subtitle and Title, The: Subtitle, The should be considered equal as well? — Iron Fist
– Iron Fist, Commented Mar 27, 2016 at 21:38
@IronFist yes, although I think dealing with the The may be better than doing a comparison on them as-is — Midavalo
– Midavalo, Commented Mar 27, 2016 at 22:54

DevShark · Accepted Answer · 2016-03-27 21:45:31Z

17

What you're trying to do has already been implemented very well in the jellyfish package.

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2

answered Mar 27, 2016 at 21:45

DevShark

9,18810 gold badges36 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Midavalo Over a year ago

Thank you DevShark, I think this coupled with parts of the other answers will get me close to where I was hoping.

DevShark Over a year ago

Great, glad it helps.

Todd Owen · Accepted Answer · 2016-09-26 12:42:33Z

12

The standard library's difflib module provides a function get_close_matches which does fuzzy string matching.

>>> import difflib
>>> difflib.get_close_matches('python', ['snakes', 'thon.py', 'pythin'])
['pythin', 'thon.py']  # ordered by similarity score

answered Sep 26, 2016 at 12:42

Todd Owen

16.3k8 gold badges57 silver badges54 bronze badges

1 Comment

wordsforthewise Over a year ago

difflib is using Ratcliff/Obershelp xlinux.nist.gov/dads/HTML/ratcliffObershelp.html which may not be as good as Levenshtein in some cases

xdola · Accepted Answer · 2016-03-27 21:31:06Z

4

You can use in keyword. It isn't a similarity comparison, but does what you want:

s = "Title: Subtitle"

if "Title" in s or "Subtitle" in s:
    match = 'True'
else:
    match = 'False'

answered Mar 27, 2016 at 21:31

xdola

5742 gold badges8 silver badges26 bronze badges

Comments

A.J. Uppal · Accepted Answer · 2016-03-27 21:34:42Z

2

Try replacing the characters and then checking the equality:

def match(str1, str2):
    str1 = str1.replace(' -', '').replace(',', '').replace(':', '')
    str2 = str2.replace(' -', '').replace(',', '').replace(':', '')
    return str1 == str2

>>> match('Title: Subtitle', 'Title - Subtitle')
True
>>> match('Title: Subtitle', 'Title, Subtitle')
True
>>>

answered Mar 27, 2016 at 21:34

A.J. Uppal

19.3k7 gold badges47 silver badges82 bronze badges

Comments

alexis · Accepted Answer · 2016-03-27 22:23:02Z

2

If the only obstacle is punctuation, the problem is trivial: Just discard non-word characters and compare the remaining lists of words.

s1 = 'Title - Subtitle'
toks1 = re.split(r"^\W+", s1)  # keep just the words
toks1 = [ w.lower() for w in toks1 ]

I threw in lowercasing since that could differ too. Apply the same to each input and compare the lists.

But as you point out, there can be other differences. If your data really consists of titles (books, movies, scientific articles), you can start by removing articles and common connectives (so-called "stopwords"), like libraries do. E.g., "The title of the article" gets stripped down to ["title", "article"]. To deal with other possible differences in word order, you could use the so-called "bag of words" approach, common in information retrieval. Convert the list of tokens to a set, or to a dictionary of word counts for cases where some words occur multiple times. Here's an example, using word counts and the nltk's "stopword" list as a filter.

import nltk
from collections import Counter
stopwords = set(nltk.corpus.stopwords.words("english"))

toks1 = [ t for t in toks1 if t not in stopwords ]
cnt1 = Counter(toks1)
cnt2 = Counter(toks2)  # Another title string, processed the same way
if cnt1 == cnt2:
    print("The two strings have exactly the same content words")

If there's still more variation, the sky is the limit. Approximate text matching is a topic of active research with applications in information retrieval, plagiarism detection, genetics, etc. You could check if one title is a subset of the other (maybe someone left out the subtitle). You could try matching by "edit distance" (e.g. the "Levenshtein distance" mentioned by a couple of other answers), applying it either to letters or to whole words. You could try information retrieval algorithms like TF-IDF score. These are just a few of the things you could try, so look for the simplest solution that will do the job for you. Google is your friend.

answered Mar 27, 2016 at 22:23

alexis

50.4k18 gold badges107 silver badges172 bronze badges

2 Comments

Midavalo Over a year ago

Very informative answer, thank you. Obviously this is a lot more complex that I first imagined.

alexis Over a year ago

It is complex, but if you think Levenshtein distance is a good fit for your task you can just leave it at that...

peter · Accepted Answer · 2016-03-27 21:30:35Z

1

I'm a Ruby programmer so no experience with Python, but in Ruby such a problem would quickly be solved by using the gem Levensthein. It calculates the number of edits you need to make at a string to get to the same string.

I see there's a Python equivalent also, so take a look at https://pypi.python.org/pypi/python-Levenshtein

answered Mar 27, 2016 at 21:30

peter

42.3k7 gold badges67 silver badges110 bronze badges

Comments

Ethan Peters · Accepted Answer · 2016-03-28 01:28:52Z

1

This should work. Python translate can be used to take out any different characters.

titles = ['Title: Sub', 'Title Sub', 'Title - Sub']
s = ': -'

if titles[1].translate(None, s) == titles[2].translate(None, s):
    match = 'True'
else 
    match = 'False'

edited Mar 28, 2016 at 1:28

answered Mar 27, 2016 at 21:45

Ethan Peters

543 bronze badges

Comments

Iron Fist · Accepted Answer · 2016-03-27 22:35:40Z

0

fnmatch.fnmatch can also be handy here though designed for Unix Filename matching, consider the following examples:

>>> from fnmatch import fnmatch
>>> l
['Title: Subtitle', 'Title - Subtitle', 'Title, Subtitle', 'Title Subtitle']
>>>
>>> all(fnmatch(x, 'Title*Subtitle') for x in l)
True

Another method, would be checking if they all match an re pattern:

>>> import re
>>> l
['Title: Subtitle', 'Title - Subtitle', 'Title, Subtitle', 'Title Subtitle']
>>> all(re.search(r'^Title.*?Subtitle$', x) for x in l)
True

edited Mar 27, 2016 at 22:35

answered Mar 27, 2016 at 22:12

Iron Fist

11k2 gold badges21 silver badges36 bronze badges

2 Comments

alexis Over a year ago

Google "regular expression" and you'll find much better ways to implement this approach.

Iron Fist Over a year ago

@alexis...Of course I know about re solution but I'm just trying to leave it as last resort to not complicate things...anyway...in my intention I was working on an alternative way using re which I already post it

Collectives™ on Stack Overflow

String similarity in Python

8 Answers 8

2 Comments

1 Comment

Comments

Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

1 Comment

Comments

Comments

2 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related