Text similarity algorithm

Question

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

My comments:
The function should return percentage of the similarity of texts - AGREE

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

The function should return percentage of the similarity of texts and you decide the threshold for TRUE or FALSE. — YOU
– YOU, Commented Feb 24, 2010 at 11:37
You're going to need to be very thoughtful about your similarity criteria and I think this may be the toughest part of what you are trying to do. For example "all the people were happy" and "all the people were not happy" are similar textually but entirely opposite in terms of meaning. Some examples of similar and dissimilar text may be helpful. — glenatron
– glenatron, Commented Feb 24, 2010 at 11:46
Check out Soundex (en.wikipedia.org/wiki/Soundex) and see if that's something you're looking for. — Buhake Sindi
– Buhake Sindi, Commented Feb 24, 2010 at 11:59
Do consider whether you want to apply Levenshtein on a whole file or just a search string — bcosca
– bcosca, Commented Feb 24, 2010 at 12:03

bcosca · Accepted Answer · 2010-02-24 11:42:51Z

13

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

answered Feb 24, 2010 at 11:42

bcosca

17.6k5 gold badges43 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Adamski Over a year ago

+1: The integer result would need to be normalised to determine the similarity of the whole file. E.g. Similarity = Levenshtein Distance / Num. Characters. I would also suggest preprocessing the file to correct spelling mistakes before applying this algorithm.

Fabian Steeg Over a year ago

There is an implementation of the Levenshtein distance in Apache Commons StringUtils: commons.apache.org/lang/api-2.4/org/apache/commons/lang/…, java.lang.String)

soulmerge Over a year ago

@Fabian: It is a builtin function in PHP: php.net/manual/en/function.levenshtein.php

Yonatan Over a year ago

Levinstain distance is not applicable for long strings. Using the StringUtils implementation, for instance, would take few minutes per file, if size of each file is ~ 300kb.

Community · Accepted Answer · 2017-05-23 12:25:06Z

6

For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.

You might want to look at several implementations that are described here: Cosine Similarity

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Nov 6, 2011 at 14:06

Yonatan

2,5312 gold badges19 silver badges20 bronze badges

Comments

Dominic Rodger · Accepted Answer · 2010-02-24 11:40:50Z

2

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

edited Feb 24, 2010 at 11:40

Dominic Rodger

100k37 gold badges204 silver badges219 bronze badges

answered Feb 24, 2010 at 11:37

soulmerge

76.2k20 gold badges121 silver badges160 bronze badges

3 Comments

Chii Over a year ago

or,render the text with a known font size (and face), and then compare pixels. that way, symbols with similar looking shape can be made to look similar, and its easier to detect that.

Jens Schauder Over a year ago

@Chii but on larger symbol shifting the rest of the page would throw everything of.

bcosca Over a year ago

I don't think the question has anything to do with OCR, but just plain text

Chinmay Kanchi · Accepted Answer · 2010-02-24 11:50:18Z

2

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

edited Feb 24, 2010 at 11:50

answered Feb 24, 2010 at 11:36

Chinmay Kanchi

66.6k24 gold badges92 silver badges115 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:25:06Z

1

There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.

The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...

Here you find a helpful implementation of several algorithms within one library

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered May 20, 2014 at 6:32

Philipp

4,7794 gold badges50 silver badges82 bronze badges

Comments

balu datascience · Accepted Answer · 2022-03-23 02:04:52Z

0

if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.

answered Mar 23, 2022 at 2:04

balu datascience

1211 silver badge5 bronze badges

2 Comments

charlie-map Over a year ago

Along with this answer, adding additional supporting information will help others confirm that your answer is correct. Could you provide citations or documentation about how the similarity algorithm works? You also mentioned cosine similarity, which you may also want to site. You can find more information on how to write good answers in the help center.

balu datascience Over a year ago

Here the question which gives more details. stackoverflow.com/questions/57882417/…

Collectives™ on Stack Overflow

Text similarity algorithm

6 Answers 6

4 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related