0

Is there any dictionary based string matching algorithm in Java?

Something that will give the percentage of similarity between two strings based on the dictionary ?

Like

 public double getSimilarity(String str1, String str2);

for which an implementation like :

 getSimilarity("Professor", "Teacher")

will give a very high percentage ?

Thanks in advance :)

6
  • 1
    Relating a "Professor" to a "Teacher" is something that us Humans can do quite easily. To a computer these are just two different sequences of characters. You are going to have to do all the heavy lifting for the computer for a problem like this. Commented Jan 13, 2013 at 15:05
  • Just run it through a thesaurus ;) Commented Jan 13, 2013 at 15:07
  • You may need a ontology for your topic which can be analysed to get a measure for 'similarity'. Commented Jan 13, 2013 at 15:07
  • @Sanchit Thanks god for Artificial Intelligene, Natural Languages Processing and Statistical approaches. This problem is actually addressed in many researches, so don't give up too early, though they are only "a sequence of characters", given the correct context - you can learn a LOT on what is the meaning of each, and how they are related one to the other. Commented Jan 13, 2013 at 15:30
  • Downvoters: Please elaborate why the downvote. I find it very helpful and very clear what the OP is asking. Commented Jan 13, 2013 at 15:32

1 Answer 1

1

There is a great work done by Shaul Markovitch and Evgeniy Gabrilovich, described in their article: Wikipedia-based Semantic Interpretation for Natural Language Processing.

The idea is as follows: Index wikipedia (or other context source).
Creating a mapping for each term (word): term -> articles in which the term appears in.

Each term is basically represented by a vector - for simplicity, let's say it is a binary vector - so for the term t the entry d will be '1' if and only if the term t appears in the document d.

Now, given these vectors - to find if two terms t1, t2 are similar - all you have to do it take the vector similarity of the two vectors that represent t1 and t2.


Note: The binary vector is a simplification, in fact the article uses the tf-idf score, that the term t has in a document d.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.