Comparing/matching Strings in java

Question

I'm dealing with a large database which have two columns. The first column id is a long while second column name is a String. name is the name of a person with corresponding id. So, I wish to compare the name of row with name of other rows.

John Carter
john Carter
Carter
jo car
Willam Carter
C William
Carter j.

All these names in rows should provide matches. If possible it would be great to have the percentage/ratio of match. Is there any java library/snippet that can do this? I'm open to all suggestions.

How could "John" ever match "William Carter"?

Oliver Charlesworth
– Oliver Charlesworth

2012-06-10 17:48:10 +00:00
Commented Jun 10, 2012 at 17:48 — Oliver Charlesworth
– Oliver Charlesworth, Commented Jun 10, 2012 at 17:48
@OliCharlesworth Sorrry, my bad.

Binoy Babu
– Binoy Babu

2012-06-10 17:51:05 +00:00
Commented Jun 10, 2012 at 17:51 — Binoy Babu
– Binoy Babu, Commented Jun 10, 2012 at 17:51
And similarly, "C. William" and "john carter"?

Oliver Charlesworth
– Oliver Charlesworth

2012-06-10 17:53:54 +00:00
Commented Jun 10, 2012 at 17:53 — Oliver Charlesworth
– Oliver Charlesworth, Commented Jun 10, 2012 at 17:53
OliCharlesworth Corrected that, but you get the idea, no?

Binoy Babu
– Binoy Babu

2012-06-10 17:59:11 +00:00
Commented Jun 10, 2012 at 17:59 — Binoy Babu
– Binoy Babu, Commented Jun 10, 2012 at 17:59

Apfelsaft · Accepted Answer · 2012-06-10 18:02:09Z

4

This library could be interesting for you: http://sourceforge.net/projects/simmetrics/

It provides different similarity measures for Strings.

From their SourceForge page:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

answered Jun 10, 2012 at 18:02

Apfelsaft

5,8664 gold badges30 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dunes · Accepted Answer · 2012-06-10 17:57:18Z

4

Looks like you'll be interested in the Levenshtein algorithm for computing string distances. You can find a Java implementation here.

answered Jun 10, 2012 at 17:57

Dunes

42.1k7 gold badges86 silver badges107 bronze badges

Comments

aiolos · Accepted Answer · 2012-06-11 13:44:13Z

Have a look at the paper 'A Comparison of String Distance Metrics for Name-Matching Tasks' of William W. Cohen et al. The paper compares several string distance metrics.

They also implemented the most of them within the SecondString project. It is a "open-source Java-based package of approximate string-matching techniques" so you could easily compare the different metrics to evaluate which of them fits your requirements.

If you just need to match names - Jaro-Winkler is a good choice, which is also implemented within the SecondString package.

If you have all your names in a database, it may makes sense to implement the similarity measure as stored procedure to avoid fetching all the data to compare them using java. So you could use queries like this:

SELECT t1.name, t2.name, sim(t1.name, t2.name) FROM table t1, table t2 WHERE sim(t1.name, t2.name) > 0.8

Collectives™ on Stack Overflow

Comparing/matching Strings in java

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related