Java hashcode() collision for objects containing different but similar Strings

Question

While verifying output data of my program, I identified cases for which hash codes of two different objects were identical. To get these codes, I used the following function:

int getHash( long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate ) {

    int result = 17;
    result = 31 * result + (int) (lID ^ (lID >>> 32));
    long temp;
    temp = Double.doubleToLongBits(dGR);
    result = 31 * result + (int) (temp ^ (temp >>> 32));
    temp = Double.doubleToLongBits(dSR);
    result = 31 * result + (int) (temp ^ (temp >>> 32));
    result = 31 * result + (sCI != null ? sCI.hashCode() : 0);
    result = 31 * result + (sCO != null ? sCO.hashCode() : 0);
    result = 31 * result + (sSearchDate != null ? sSearchDate.hashCode() : 0);

    return result;
}

These are two example cases:

getHash( 50122,"03/25/2015","03/26/2015",4.0,8.0,"03/24/15 06:01" )
getHash( 51114,"03/24/2015","03/25/2015",4.0,8.0,"03/24/15 06:01" )

I suppose, this issue arises, as I have three very similar strings present in my data, and the difference in the hashcode between String A to B and B to C are of the same size, leading to an identical returned hashcode.

The proposed hashcode() implementation by IntelliJ is using 31 as a multiplier for each variable that contributes to the final hashcode. I was wondering why one is not using different values for each variable (like 33, 37, 41 (which I have seen mentioned in other posts dealing with hashcodes))? In my case, this would lead to a differentiation between my two objects.

But I'm wondering whether this could then lead to issues in other cases?

Any ideas or hints on this? Thank you very much!

33 is not an odd prime number, 31 was chosen because it's so, it's also used because of performance issues. — Maroun
– Maroun, Commented Mar 26, 2015 at 11:40
Hmmm... Maybe you want to watch this video. Just a hunch but it may have the answer you are looking for (there is a section about String's .hashCode() implementation) — fge
– fge, Commented Mar 26, 2015 at 11:40
You haven't shown a problem here... yes, different objects can have the same hash code. That's fine for any sensible use of a hash code. — Jon Skeet
– Jon Skeet, Commented Mar 26, 2015 at 11:41

Anderson Vieira · Accepted Answer · 2015-03-26 12:59:16Z

5

The hashCode() contract allows different objects to have the same hash code. From the documentation:

It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

But, since you've got a bunch of parameters for your hash, you may consider using Objects.hash() instead of doing your own implementation:

@Override
int getHash(long lID, String sCI, String sCO, double dSR, double dGR, String sSearchDate) {
    return Objects.hash(lID, sCI, sCO, dSR, dGR, sSearchDate);
}

For example:

Objects.hash(50122, "03/25/2015", "03/26/2015", 4.0, 8.0, "03/24/15 06:01")
Objects.hash(51114, "03/24/2015", "03/25/2015", 4.0, 8.0, "03/24/15 06:01")

Results in:

-733895022
-394580334

edited Mar 26, 2015 at 12:59

answered Mar 26, 2015 at 11:42

Anderson Vieira

9,0692 gold badges40 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Matthias Over a year ago

Thank you very much for your answer and the proposed function. I was somehow convinced that two different objects will lead to two different hash codes. Well it looks like I learned something here. :) I guess the implementation of HashSet, HashMap etc are using the hashcode then as a primary check of uniqueness and in case of identification of duplicates go on the verify those via the equals function.

Anderson Vieira Over a year ago

Glad to help! Yes, if you put two objects that have the same hash in a HashSet they will end up in the same bucket. Then contains() will use the hash to find the bucket, and equals() to identify the exact object you're looking for. If all objects had the same hash it would severely degrade the performance, but the set methods would still work properly.

Sebastian · Accepted Answer · 2015-03-26 12:05:08Z

1

The code shown by your may add zero for example by

result = 31 * result + (sCI != null ? sCI.hashCode() : 0);

When adding some zeros this may degenerate to a multiplation of

31 * 31 * 31 ...

which could destroy uniqueness.

However the hashCode method is not intended to return unique values. It simply should provide a uniform distribution of values and it should be easy to compute (or cache hashCode as the String class does).

From a more theoretical point of view a hashCode maps from a large set A into a smaller set B. Hence collisions (different elements from A map to the same value in B) are unavoidable. You could choose a set B which is bigger than A but this would violate the purpose of hashCode: performance optimization. Really, you could achieve anything with a linked list and some additional logic what you achieve with hashCode.

Prime numbers are chosen as they result in a better distribution. For example if using none primes 4*3 = 12 = 2*6 result in the same hashCode. The 31 is sometimes chosen as it is a Mersenne prime number 2^n-1 which is said to perform better on processors (I'm not sure about that).

As the hashCode method is specified not return unambiguously identify elements non-unique hashCodes are perfectly fine. Assuming uniqueness of hashCodes is a bug.

However a HashMap can be described as a set of buckets with each bucket holding a single linked list of elements. The buckets are indexed by the hashCode. Hence providing identical hashCodes leads to less buckets with longer lists. In the most extreme case (returning an arbitrary constant as hashCode) the map degenerates to a linked list.

When an object is searched in a hash data structure, the hashCode is used to get the bucket index. For each objetc in this bucket the equals method is invoked -> long lists means a large number of invocations of equals.

Conclusion: Assuming that the hashCode method is correctly used this can not cause a program to malfunction. However it may result in a severe performance penalty.

answered Mar 26, 2015 at 12:05

Sebastian

4153 silver badges8 bronze badges

2 Comments

Matthias Over a year ago

Thank you very much for your detailed answer. The uniqueness in this case is not destroyed because of an empty string, but it is true that it could. My problem was probably more that I didn't get the concept of the hash code correctly; not needing to return unique values. Thanks also for the additional insight.

Sebastian Over a year ago

You are welcome. However if you are not this used to the concept I strongly recommend to really careful read the JavaDoc for Object#equals. Maybe asked one of your colleagues if you do not know what a equivalence relation is (what does reflexivity, transitivity and symmetry) means. This bears some problems with inheritance where you can easily run into a situation where you have to either break Liskov's substituion principle or to violate the symmetry property of equals.

juhist · Accepted Answer · 2015-03-26 12:13:24Z

1

Ash the other answers explain well, it is allowed for hashCode to return same values for different objects. This is not a cryptographic hash value so it's easy to find examples of hashCode collisions.

However, I point out a problem in your code: if you have made the hashCode method yourself, you should definitely be using a better hash algorithm. Take a look at MurmurHash: http://en.wikipedia.org/wiki/MurmurHash. You want to use the 32-bit version. There are also Java implementations.

Yes, hash collisions can lead to performance issues. Therefore it's important to use a good hash algorithm. Additionally, for security MurmurHash allows a seed value to make hash collision denial of service attacks harder. You should generate that seed value you use randomly on the start of the program. Your implementation of the hashCode method is vulnerable to these hash collision DoS attacks.

answered Mar 26, 2015 at 12:13

juhist

4,34420 silver badges35 bronze badges

1 Comment

Matthias Over a year ago

Thank you very much for your answer. The function I'm using is the proposed implementation given by IntelliJ for creating the hashcode. In this case it is not used to decrypt sensitive information, but your comment on that is really interesting. I will dig a little bit deeper into that.

Collectives™ on Stack Overflow

Java hashcode() collision for objects containing different but similar Strings

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related