2

I'm dealing with a large database which have two columns. The first column id is a long while second column name is a String. name is the name of a person with corresponding id. So, I wish to compare the name of row with name of other rows.

John Carter
john Carter
Carter
jo car
Willam Carter
C William
Carter j.

All these names in rows should provide matches. If possible it would be great to have the percentage/ratio of match. Is there any java library/snippet that can do this? I'm open to all suggestions.

4
  • 4
    How could "John" ever match "William Carter"? Commented Jun 10, 2012 at 17:48
  • 1
    @OliCharlesworth Sorrry, my bad. Commented Jun 10, 2012 at 17:51
  • And similarly, "C. William" and "john carter"? Commented Jun 10, 2012 at 17:53
  • OliCharlesworth Corrected that, but you get the idea, no? Commented Jun 10, 2012 at 17:59

3 Answers 3

4

This library could be interesting for you: http://sourceforge.net/projects/simmetrics/

It provides different similarity measures for Strings.

From their SourceForge page:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

Sign up to request clarification or add additional context in comments.

Comments

4

Looks like you'll be interested in the Levenshtein algorithm for computing string distances. You can find a Java implementation here.

Comments

0

Have a look at the paper 'A Comparison of String Distance Metrics for Name-Matching Tasks' of William W. Cohen et al. The paper compares several string distance metrics.

They also implemented the most of them within the SecondString project. It is a "open-source Java-based package of approximate string-matching techniques" so you could easily compare the different metrics to evaluate which of them fits your requirements.

If you just need to match names - Jaro-Winkler is a good choice, which is also implemented within the SecondString package.

If you have all your names in a database, it may makes sense to implement the similarity measure as stored procedure to avoid fetching all the data to compare them using java. So you could use queries like this:

SELECT t1.name, t2.name, sim(t1.name, t2.name) FROM table t1, table t2 WHERE sim(t1.name, t2.name) > 0.8

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.