1

I want to "implement" a hash function from Strings to shorts, using the java standard hashCode() function of String object. I came up with the following simple implementation:

static short shortHashCode(String str)
{
   int strHashCode = str.hashCode();
   short shorterHashCode = (short) (strHashCode % Short.MAX_VALUE);
   return shorterHashCode;
}
  1. Is my shortHashCode function a good hash function? Meaning is the chance of collisions small (chance that two different Strings will have the same hash code close to 1/Short.MAX_VALUE) ?
  2. Is there a better way to implement hash function from Strings to shorts?
6
  • 2
    Yes! a hash function must return the same hash code for identical inputs Commented Aug 4, 2014 at 14:52
  • As far as chances are concerned, there are a possibility of 32768 different short values returned. The chances are completely depended on the input set. If you use a random generator, then it depends on what statistic model that generator is based off (does each value have an equal chance or do some values have higher chances over others, etc.). Commented Aug 4, 2014 at 14:55
  • Lets just say equal chance Commented Aug 4, 2014 at 14:56
  • This question appears to be off-topic because it is asking for a code review and opinion on working code. It is off-topic here as ( too broad and opinion based ) and should probably be on codereview.stackexchange.com Commented Aug 4, 2014 at 15:00
  • 1
    @JarrodRoberson, I think the question more relates to convert between two language-specific primitives while preserving the useful semantic properties of the original. That seems on-topic to me. Commented Aug 4, 2014 at 15:03

1 Answer 1

5
(short) (strHashCode % Short.MAX_VALUE);

is losing information unnecessarily.

 (short) (strHashCode % ((Short.MAX_VALUE + 1) << 1));

would not, but would be equivalent anyway to

 (short) strHashCode

since casting an integral type to a smaller integral type just truncates the most significant bits.


It also assumes that all bits have the same entropy, which may not be true. You could try and spread the entropy around:

 (short) (strHashCode ^ (strHashCode >>> 16))

which XORs the high 16 bits with the low 16 bits.


Meaning is the chance of collisions small (chance that two different Strings will have the same hash code close to 1/Short.MAX_VALUE) ?

java.lang.String.hashCode is not a cryptographically strong hash function, so it only has that property if an attacker can't control one or both inputs to force a collision.

If you expose it to strings from an untrusted source, you might see a much higher rate of hash collisions, possibly allowing an attacker to deny service.

Also, it is designed to tradeoff a small increase in collision rate for better performance, and cross-version stability. There are better string hashing functions out there.

Sign up to request clarification or add additional context in comments.

1 Comment

Some examples showing that generating collisions for String.hascCode is pretty easy: david-soroko.blogspot.co.uk/2015/06/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.