Abstract Algorithm: String / Byte Comparison / Diff

Question

This is a rather abstract question as I yet have no idea how to solve it and haven't found any suitable solutions.

Let's start with the current situation. You'll have an array of byte[] (e.g. ArrayList<byte[]>) which behind the scene are actually Strings, but at the current state the byte[] is prefered. They can be very long (1024+ bytes for each byte[] array whereas the ArrayList may contain up to 1024 byte[] arrays) and might have a different length. Furthermore, they share a lot of the same bytes at the "same" locations (this is relativ, a = {0x41, 0x41, 0x61}, b = {0x41, 0x41, 0x42, 0x61 } => where the first 0x41 and the last 0x61 are the same).

I'm looking now for an algorithm that compares all those arrays with each other. The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.

If possible without using any third party libraries (but i doubt it is feasible in a reasonable time without one).

Any suggestions are very welcome.

Edit:

Made some adjustments.

EDIT / SOLUTION:

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.

Thanks for your input / answers, they were a great assistance.

Unclear. "You'll have an array of byte[]" => 1 array. "They can be very long (~1024+ bytes each)" => at least 2 arrays. How many are there? Regardless, the answer is probably all vs. all Levenshtein distance; Google it. — j_random_hacker
– j_random_hacker, Commented Apr 14, 2016 at 11:46
@j_random_hacker - thanks for you answer. I was already looking at Levenshtein distance, but read that it doesn't perform well with long strings (which might be the case here? haven't found a definition of the exact length). Furthermore, you compare 2 strings and not a bunch of strings, which makes me wonder which strings you'll have to compare (you don't have a "baseline"). Regarding the "unclear" part: I've adjusted the question it is an ArrayList<byte[]> whereas the size of the ArrayList is up to 1024 and the size of each byte[] array is 1024 - undefined (very long... >_<) — Code Monkey
– Code Monkey, Commented Apr 14, 2016 at 12:00
Since you have to handle some size between 1024 and undefined, an Array is a very bad choice. If possible you should use some structure that can grow indefinitely e.g. a List implementation of your choice. Even if the Levenshtein distance if expensive to calculate, it appears to be the relevant metric here. Comparing all arrays against all arrays will also have runtime characteristics of O(n²), which argueably isnt fast. — nitowa
– nitowa, Commented Apr 14, 2016 at 12:03
I see I misread "You'll have an array of byte[]", sorry about that, and thanks for clarifying. Call the alphabet A (so e.g. |A| = 256 if any byte value can appear). If the strings are different lengths, and you would be happy with an approximate distance, you could instead pick some small k (e.g. k=3), and for each of the input strings simply count how many times each of the |A|^k possible length-k substrings appears in that string to get a frequency vector of length |A|^k for each string. (Most entries will be 0, so use a sparse representation.) You can then compare all vector pairs. — j_random_hacker
– j_random_hacker, Commented Apr 14, 2016 at 12:22

AnoE · Accepted Answer · 2016-04-14 16:21:40Z

The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.

You will not be able to find a solution where your metric and the time is independent, they go hand in hand.

For example: if your metric is like the example from your post, that is d(str1,str2) = d(str1.first,str2.first) + d(str1.last,str2.last), then the solution is very easy: sort your array by first and last character (maybe separately), and then take the first and last element of the sorted array. This will give you O(n logn) for the sort.

But if your metric is something like "two sentences are close if they contain many equal words", then this does not work at all, and you end up with O(n²). Or you may be able to come up with a nifty way to re-order your words within the sentences before sorting the sentences etc. etc.

So unless you have a known metric, it's O(n²) with the trivial (naive) implementation of comparing everything while keeping track of the maximum delta.

Code Monkey · Accepted Answer · 2016-04-25 13:15:00Z

0

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.

Thanks for your input / answers, they were a great assistance.

answered Apr 25, 2016 at 13:15

Code Monkey

781 silver badge8 bronze badges

Collectives™ on Stack Overflow

Abstract Algorithm: String / Byte Comparison / Diff

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related