Dictionary based string matching algorithm - Java

Question

Is there any dictionary based string matching algorithm in Java?

Something that will give the percentage of similarity between two strings based on the dictionary ?

Like

 public double getSimilarity(String str1, String str2);

for which an implementation like :

 getSimilarity("Professor", "Teacher")

will give a very high percentage ?

Thanks in advance :)

Relating a "Professor" to a "Teacher" is something that us Humans can do quite easily. To a computer these are just two different sequences of characters. You are going to have to do all the heavy lifting for the computer for a problem like this. — Sanchit
– Sanchit, Commented Jan 13, 2013 at 15:05
You may need a ontology for your topic which can be analysed to get a measure for 'similarity'. — MrSmith42
– MrSmith42, Commented Jan 13, 2013 at 15:07
@Sanchit Thanks god for Artificial Intelligene, Natural Languages Processing and Statistical approaches. This problem is actually addressed in many researches, so don't give up too early, though they are only "a sequence of characters", given the correct context - you can learn a LOT on what is the meaning of each, and how they are related one to the other. — amit
– amit, Commented Jan 13, 2013 at 15:30
Downvoters: Please elaborate why the downvote. I find it very helpful and very clear what the OP is asking. — amit
– amit, Commented Jan 13, 2013 at 15:32

amit · Accepted Answer · 2013-01-13 15:27:35Z

1

There is a great work done by Shaul Markovitch and Evgeniy Gabrilovich, described in their article: Wikipedia-based Semantic Interpretation for Natural Language Processing.

The idea is as follows: Index wikipedia (or other context source).
Creating a mapping for each term (word): term -> articles in which the term appears in.

Each term is basically represented by a vector - for simplicity, let's say it is a binary vector - so for the term t the entry d will be '1' if and only if the term t appears in the document d.

Now, given these vectors - to find if two terms t1, t2 are similar - all you have to do it take the vector similarity of the two vectors that represent t1 and t2.

Note: The binary vector is a simplification, in fact the article uses the tf-idf score, that the term t has in a document d.

answered Jan 13, 2013 at 15:27

amit

179k27 gold badges245 silver badges348 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Dictionary based string matching algorithm - Java

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related