4

I've been reading a bit about Java String class' hashcode here recently, and I haven't been able to find this information : what happens when string's length is higher than 32 (I know an overflow then happens, but as a hash key, what happens)? For example, I need to hash strings that are between 20 and 120 characters long to use them as hash keys. Do I need to implement my own algorithm using BigInteger?

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough?

12
  • A key to what (what data structure/Collections class)? What are you trying to do? Commented Aug 19, 2015 at 22:12
  • I'm trying to make a file archive. There can be thousand and dozens of thousand files in it. It would be too long to make string comparisons for the files names, to find the file's location in the archive. So I need a hashtable. Commented Aug 19, 2015 at 22:37
  • Look also at the Bloom Filter which can help you to increase the efficiency of queries to your data repository. en.wikipedia.org/wiki/Bloom_filter Commented Aug 19, 2015 at 23:02
  • A file archive as in a jar (or a zip)? There are apis to create and manipulate jars/zips - have you looked at these? Commented Aug 19, 2015 at 23:18
  • 1
    @KevinHooke: A compressed data format would be efficient only if queries on the contents of the Strings are rarely done. It sounds to me like a full text index of the sort that would be managed by Lucene (a full text index and search framework provided by the Apache Software Foundation) would be best for archiving a large number of text-based articles. Commented Aug 19, 2015 at 23:25

3 Answers 3

14

(I know an overflow then happens, but as a hash key, what happens)?

In Java, arithmetic overflows and underflows of primitive types do not raise runtime errors or exceptions. The overflowed portion of the result is simply lost.

While this can result in logic errors or other difficulties if the programmer is not aware of this property, it is the specified behavior of the JVM.

You do not need to worry about overflow or underflow of int types when calculating hashcodes. The overflowed bits are simply lost.

This does not affect the correctness of the computed hash value or its ability to distribute to hash buckets well.

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough?

A couple things that can be handy to keep in mind:

  • Java Strings are immutable. For this reason, the hash value of a String instance is calculated only once. After that, the result is cached in the instance so that subsequent invocations of hashCode() do not result in repeated computations. This works because Strings are immutable and recomputing the value would be the same every time.

  • The hash code really should be computed from all the meaningful information in an instance. This means that if your String contains 20k of information, the hash code should be computed from all 20k of it (but see above). Of course, there are performance implications, so you should design your program accordingly.

  • Collision 'free'-ness has much, much more to do with the quality of your hashCode() implementation and less to do with the size of your Strings. Algorithms used to generate hash codes should be capable of producing good distributions. What a "good hash function" is isn't precisely known, but is a subject for mathematical theorists. Fortunately it is not hard to define a hash function that is "good enough" even if it may not be "state of the art" (see Effective Java, 2nd ed.; J. Bloch).

Sign up to request clarification or add additional context in comments.

Comments

5

You are misunderstanding what hashCode() does. It calculates a 32-bit number that should be different for different values, but is not guaranteed to be so. How could it, then there might be more than 2^32 different values to hash.

For a String, the hashCode has nothing to do with the string length. Any hashCode is a valid hashCode for any string, as long as your always get the same hashCode for the same String, i.e. calling hashCode() multiple times for the same sequence of characters must return the same value.

As an example, here are some hash codes for strings.

0x00000000 = "".hashCode()
0x00000061 = "a".hashCode()
0x00000041 = "A".hashCode()
0x042628b2 = "Hello".hashCode()
0x6f8f80f1 = "Goodbye".hashCode()
0xdbacdd53 = "The quick brown fox jumps over the lazy dog".hashCode()
0x99eecd2e = "The quick brown fox jumps over the lazy dog!".hashCode()

Notice that the last two are a very long (>32) string.

1 Comment

@Maxime - Based on your answer to my question about what you are trying to do (a file lookup), I don't think hashCode() is what you are looking for as it's not guaranteed to be unique.
2

There is no overflow on Strings. Strings can be as long as your process' memory can hold. The hashCode of any String is a 32-bit integer. The collision frequency should not have a correlation with the String's length. You don't need to reimplement it.

5 Comments

I mean, the int used to hold hashcode undergoes an overflow if string's length is higher than 31.
The overflow is not a problem as long as most (or all) possible integer values can be reached (i.e. equally distributed). And of course more than one (possible infinite) strings map to the same hashCode (including negative ones).
BTW1: String.hashCode of Oracle Java overflows much faster than in 31 chars. For example the first two overflows in the first 7 ASCII chars: "ZZZZZ".hashCode()==85887450, "ZZZZZZ".hashCode()==-1632456256, "ZZZZZZZ"=933463706
BTW2: in Java 1.2 the hashcode implementation has changed, before it was only considering the first 16 characters. Which caused a lot of collisions for strings which did not differ in a commons prefix (as it would be with file path). Since then you normally dont have to worry about pathological cases for hashes.
actually a array (the char[] array in string) is limited to MAX_INTEGER elements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.