1

I need a hash algorithm that produces a 64-bit hash code (long) with fewer collisions than String.GetHashCode() and that is fast (no expensive calls to cryptographic functions). Here's an implementation of FNV which still shows 3% of collisions after testing 2 million random strings. I need this number to be way lower.

void Main()
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";
    const int n = 2000000;
    var random = new Random();
    var hashes = new HashSet<long>();
    int collisions = 0;
    for(int i = 0; i < n; i++)
    {
        var len = random.Next(chars.Length);
        var str = new char[len];
        for (int j = 0; j < len; j++)
        {
            str[j] = chars[random.Next(chars.Length)];
        }
        var s = new String(str);
        if(!hashes.Add(Get64BitHash( s ))) collisions++;
    }
    Console.WriteLine("Collision Percentage after " + n + " random strings: " + ((double)collisions * 100 / n));
}


public long Get64BitHash(string str)
{
  unchecked
  {
     byte[] data = new byte[str.Length * sizeof(char)];
     System.Buffer.BlockCopy(str.ToCharArray(), 0, data, 0, data.Length);

     const ulong p = 1099511628211UL;
     var hash = 14695981039346656037UL;
     foreach(var d in data)
     {
        hash ^= d;
        hash *= p;
     }
     return (long) hash;
  }
}

SAMPLE OUTPUT OF ABOVE CODE:

Collision Percentage after 2000000 random strings: 3.01485 %

3% is the same collision percentage as just calling String.GetHashCode(). I need something way better.

PS: There's a chance I am doing something terribly long.

EDIT: Solved. Get64BitHash method above is correct. The problem was that my strings weren't random. After making sure strings are unique (see revised code below), I get zero collisions on almost 50 million unique strings, versus ~1% collisions using String.GetHashCode().

void Main()
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";
    const int n = 200000000;
    var random = new Random();
    var hashes = new HashSet<long>();
    var strings = new HashSet<string>();
    int collisions = 0;
    while(strings.Count < n)
    {
        var len = random.Next(chars.Length);
        var str = new char[len];
        for (int j = 0; j < len; j++)
        {
            str[j] = chars[random.Next(chars.Length)];
        }
        var s = new String(str);
        if(!strings.Add(s)) continue;
        if(!hashes.Add(s.GetHashCode())) collisions++;
    }
    Console.WriteLine("Collision Percentage after " + n + " random strings: " + ((double)collisions * 100 / strings.Count));
}
4
  • It might help if you can specify what target you are trying to hit - specifically how much better than 3% does it need to be? Commented Jul 16, 2015 at 21:35
  • "no expensive calls to cryptographic functions" - NEVER. ROLL. YOUR. OWN. SECURITY. Commented Jul 16, 2015 at 21:38
  • 1
    You could look at the 64-bit variant of MurmurHash or simply take the least-significant 64 bits of the 128-bit MurmurHash. Commented Jul 16, 2015 at 21:45
  • Why isn't the calls to your function Get64BitHash not consider expensive? Commented Jul 16, 2015 at 21:59

3 Answers 3

3

The problem is your strings aren't random. Test your string before hashing it a second time.

Sign up to request clarification or add additional context in comments.

4 Comments

Why aren't they random? They seem quite random to me.
Running your string creator for loop will create duplicate strings. Also, it will create an empty string. Hash algorithm collision has to do with the potential to hash an infinitely long string to a fixed length, therefore multiple strings can hash to the same result.
OK, that's true. There can be short strings that are colliding. Good catch.
Yeap, @drooksy is right. Those weren't completely random strings. After checking for duplicates, things improve quite a bit.
1

3% is the same collision percentage as just calling String.GetHashCode()

Maybe that is the theoretical optimum. The built-in hash code is not bad. Try it with SHA2 to confirm that this is the best you can do.

Since your test strings are random the hash codes are probably well distributed as well.

Optimize the function by not creating two temporary buffers that do not seem to serve any purpose. Just access the chars directly (str[0]). That way you save the copy and process two bytes per iteration.

5 Comments

One important issue if you planning on persisting or transmitting the value. String.GetHashCode() is only guaranteed to give you the same value within the same AppDomain, if you save the value and re-run the program or send the value to another computer running the program a GetHashCode() call may not return the same value.
@usr What do you mean by processing the string twice?
@3-14159265358979323846264 run the loop twice over the same data.
If 3% of hashes are colliding, then surely 3% of hashes of hashes will also collide? Or am I still not understanding?! It's late so it's probably me!
OK, you've got a point there :) (I did not mean to hash the hash - I meant two loop passes without resetting hash. But your point stands.) Two passes shuffle the bits better, though, which might give a hash table an advantage.
0

You should count the real Hash collisions, because most of them result from colliding strings.

Declare the following :

var hashesString = new HashSet<string>();
int collisionsString = 0 ;
int testedCollisions = 0 ;

Then modify your code as follow:

   if(hashesString.Add(s))
   { // Count collisions only for new strings
     testedCollisions++ ;
     if (!hashes.Add(Get64BitHash( s ))) collisions++;
   }
 }
 Console.WriteLine("Collision Percentage after " + testedCollisions + " random strings: " + ((double)collisions * 100 / testedCollisions));

I did a run with the updated code and got no real collisions (just 60 000 duplicated strings).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.