0

I have some Strings and i want a measure for their similarity, but different from string edit distance for example, based more on structural similarities than on letter similarity.

For example: 312164 and 48479 should get a very high score, since they are only numbers and have same length. For Bla blubb and bla bloob blo should be the same, because they only contain letters and have gaps in between. Less score should be applied to couples like apple and app3 f, even if they share some letters, but have different structure.

Something like that... Anybody has a clue? In Java, if possible.

Thank you!

2
  • That is a very specific requirement. You will need to remember which characters are used, in what order, and what type they are, alphabetical, numerical, other ($, !, #, _, etc.). Commented Aug 22, 2013 at 16:05
  • What about something like StringUtils.getLevenshteinDistance() - commons.apache.org/proper/commons-lang/apidocs/org/apache/…, java.lang.CharSequence)? Commented Aug 22, 2013 at 19:33

1 Answer 1

1

Define and score them in similarities.

Example strings:

Banana

Orange

Orange 123

Banana 234

Length = x point where x is the length

Same character = 1 point (A != a)

Same position for the similar character = 2 points

Deduct point for characters that are unique to each string

e.g. Compare Banana with Orange

Length = 6 points (Both are 6 in length)

For 'a' = 1 point (Both have a). If both had two a's, we would give 2 points. We would give another 2 points if 'a' was in the same position in both strings.

For 'n' = 1 point

Total positive points: 8

1 for B since Orange doesn't have B

2 for 'a' since Banana has three a's

1 for 'n' since Banana has 2 n's

1 for O

1 for r

1 for g

1 for e

Total minus: 8

total plus points - total minus points = 0

This is just a rough logic but you can derive something out of it.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! I was thinking similar, but maybe there is a more general approach? Thanks again
@maggu Your situation is a specific situation and I don't know of any general approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.