36

I want to do some tests which require some strings with the same hash code, but not the same strings. I couldn't find any examples, so I decided to write a simple program to do it for me.

The code below generates two random strings over and over until they generate the same hash code.

    static Random r = new Random();
    static void Main(string[] args)
    {
        string str1, str2;
        do
        {
            str1 = GenerateString();
            str2 = GenerateString();
        } while (str1.GetHashCode() != str2.GetHashCode() && str1 != str2);

        Console.WriteLine("{0}\n{1}", str1, str2);
    }

    static string GenerateString()
    {
        string s = "";
        while (s.Length < 6)
        {
            s += (char)r.Next(char.MaxValue);
        }
        return s;
    }

This code seems to be working (theoretically), but it may take centuries to complete. So I was thinking of doing vice versa and generate two strings from one hash code.

I know it's not possible to retrieve a string from a hash code, but is it possible to generate possible strings from it?

I'm using Visual Studio 2015 Community Edition. Version: 14.0.23107.0D14REL.

.NET Framework: 4.6.00081.

17
  • 2
    If you want just for a test, you can create a class and override GetHashCode with a bad implementation that returns a fixed value, effectively producing collisions everywhere. Commented Aug 15, 2015 at 17:25
  • 12
    It takes just a few seconds, the Birthday Paradox makes it quick. "1241308".GetHashCode() == "699391".GetHashCode() in 32-bit code, "942".GetHashCode() == "9331582".GetHashCode() in 64-bit code. Commented Aug 15, 2015 at 17:34
  • 4
    Taking advantage of the birthday paradox is done by storing all of the strings you've generated so far. Each time you generate a new one you check it against all of the others. It increases the chances of finding a match enormously. If there are N hash codes then your method will have a 50% chance of finding a collision after checking N/2 strings. The birthday method will have a 50% chance after checking √N strings. For 32-bit hash codes where N=~4 billion, this is the difference between checking 2 billion strings or 65 thousand of them. Commented Aug 15, 2015 at 17:45
  • 1
    @ScottChamberlain: That's not true. The math only applies if the hash-values are approximately uniformly-random. If they are not, you can easily need many more hashes to find a collision (ex. hash(x) = x%INT_MAX, you'd iterate over INT_MAX values before finding a collision). Commented Aug 15, 2015 at 23:06
  • 1
    If you are using 64 bits, you can just use "\0Anything" and "\0Really, anything will work."- Why is String.GetHashCode() implemented differently in 32-bit and 64-bit versions of the CLR?. Of course, it is better to look for these strings for you current implementation - \0 will not always work. Commented Aug 16, 2015 at 5:16

2 Answers 2

39

Finding two strings by repeatedly comparing random strings will take practically forever. Instead generate strings and store them in an a dictionary by hashcode. Then look up each. Match found pretty quickly.

MATCH FOUND!! xqzrbn and krumld hash code 80425224

void Main()
{

    var lookup = new Dictionary<int,string>();

    while(true) {
        var s = RandomString();        
        var h = s.GetHashCode();
        string s2;
        if (lookup.TryGetValue(h, out s2) && s2 != s) {
            Console.WriteLine("MATCH FOUND!! {0} and {1} hash code {2}",
                lookup[h],
                s,
                h);
            break;
        }
        lookup[h] = s;

        if (lookup.Count % 1000 == 0) {
            Console.WriteLine(lookup.Count);
        }
    }
}

static Random r = new Random();

// Define other methods and classes here
static string RandomString() {

    var s = ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString();

    return s;
}
Sign up to request clarification or add additional context in comments.

12 Comments

Thanks for the answer. @ScottChamberlain and SamuelNeff. it seems the method generating the string also affects the chances! more characters reduces the chance to get same hashcode.
@M.kazemAkhgary That's not true, You feel that way because the alpha method uses a lot more CPU power dus taking longer to find a match. the number of items should be the same.
@Behrooz The only way a dictionary can "crawl" is if you store lots of values that have the exact same HashCode but Equals( returns false. Using a int it is impossible to have (x.GetHashCode() == y.GetHashCode()) && (x.Equals(y) == false)
@Behrooz See this old answer of mine where I go in to the process in more detail than a comment will allow.
@Behrooz note O(1) does not mean fast, it means constant time. It can be constantly slow and still be O(1). That's not to say a dictionary is slow, just saying in general. Performance of a dictionary depends on the number of buckets and distribution of keys across buckets (randomness of hashcode).
|
31

Take advantage of the Birthday Paradox. Instead of only testing two strings directly, test all strings you have seen before.

using System;
using System.Collections.Generic;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            var words = new Dictionary<int, string>();

            int i = 0;
            string teststring;
            while (true)
            {
                i++;
                teststring = i.ToString();
                try
                {
                    words.Add(teststring.GetHashCode(), teststring);
                }
                catch (Exception)
                {
                    break;
                }
            }

            var collisionHash = teststring.GetHashCode();
            Console.WriteLine("\"{0}\" and \"{1}\" have the same hash code {2}", words[collisionHash], teststring, collisionHash);
            Console.ReadLine();
        }
    }
}

For my machine it produces the output

"699391" and "1241308" have the same hash code -1612916492

almost instantly.

Due to how strings are hashed in .NET you may not get the exact same output as me, but it should be just as fast.

2 Comments

When running this on .NET Core, the hash code changes every time the application runs, so any hash codes posted in this question will be different than your application. This behavior is described here. If you need two strings with the same hash code for testing, as I did, you will need to generate them every time your application runs. Fortunately this takes less than 100 ms. Or you can create a new class with a bad implementation of GetHashCode, as Alejandro suggested.
Yes, that is why I put "Due to how strings are hashed in .NET you may not get the exact same output as me, but it should be just as fast."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.