0

I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.

The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.

private short sumHash(String s) {
    short hash=0;
    for (char c:s.toCharArray()) {
        hash+=c;
    }
    return hash;
}

Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.

private short productHash(String s) {
    short hash=1;
    for (char c:s.toCharArray()) {
        hash*=c;
    }
    return hash;
}

Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?

10
  • 1
    You say the hash function becomes "inaccurate" on 0. Can you be more specific what accuracy you're talking about here? In particular, are you expecting equal hashes to imply equal anagrams? Because that's impossible since there are more anagrams than integers ... Commented Apr 7, 2022 at 19:34
  • 1
    One solution is to use modulo multiplication. In multiplication mod p, cummutativity is preserved if p is prime. And modulo will allow you to establish an upper bound to prevent overflow. Commented Apr 7, 2022 at 19:37
  • @menton Yes. Much like the use of cryptographic hashes as digital signatures, I'm happy enough with a highly probable outcome, knowing that "impossible" isn't ruled out. Commented Apr 7, 2022 at 20:20
  • 1
    codereview.stackexchange.com may be a good venue for this kind of question Commented Apr 8, 2022 at 0:18
  • 1
    @Code-Apprentice I'll just remove it then. It doesn't specifically answer my question about overflows. Commented Apr 8, 2022 at 0:43

3 Answers 3

3

Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write

hash *= primes[c];

where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.

(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)

Sign up to request clarification or add additional context in comments.

1 Comment

If one hard-coded the list of primes, it would certainly look bulky, but is arguably the most performant way to do the job.
2

You will only hit zero if

a * b = 0 mod 2^64

which is equivalent to there being an integer k such that

a * b = k * 2^64 

That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:

for (char ch : chars) {
  hash *= (ch << 1) | 1;
}

This allows you to keep 63 bits of information.

Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.

1 Comment

Combined with addition and xor, this technique produced no non-anagram collisions when tested against the first billion integers mapped to the alphabet as base-26 numbers. By comparison, using modular arithmetic found 68 non-anagram pairs (and their permutations) with the same hash value. Nicely done!
1

The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.

8 Comments

I agree. There are even valid English words for which the product of ASCII values for each letter would overflow a long long. Suppose I want to fit the result in an int. Would it be appropriate to use (1<<31)-1 for p and perform intermediate calculations using a long?
@phatfingers That might work. You could also do (a % p) * (b % p) %p if a and b can ever be larger than p individually.
@phatfingers Are there any limitations on the allowed characters? If you are only allowing alphabetic characters, for example, you could map them to 1 through 26 which greatly reduces the magnitude of repeated products compared to using the ascii values.
It's common to see anagram problems limited to 26 characters, but even so, it would still be easy to overflow a long at 5 bits per character vs. 7 bits per character.
If a is a character and b is the running total % p, then each individual a and b are safe.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.