7

not sure if this is possible but in python there is a hash() function which takes a string or an integer and generates a [EDIT not-unique] integer representation of that input.

My question is (after searching online), how to reverse the generated integer back into the original String.

Thanks.

7
  • 17
    You can’t, and it’s not unique. That’s what makes it a hash. Commented Jun 30, 2013 at 19:11
  • 1
    You can't. This is why hashing is used in cryptography. Commented Jun 30, 2013 at 19:11
  • 3
    @doukremt: Not all hashes are cryptographically safe. The hash() function in Python is definitely not. Commented Jun 30, 2013 at 19:12
  • 3
    Also, from python3.3 onwards hash randomization in turned on by default, making hashes be different between different invocations of the interperter. Commented Jun 30, 2013 at 19:14
  • 1
    @minitech: Thanks. I guess it has the nice advantage that if someone uses this non-secure hash to hash passwords in a database, he will have a bad surprise. Commented Jun 30, 2013 at 19:23

6 Answers 6

12

The built-in function hash in Python is used mainly for hashing keys in collections like dict and set. The only required property, more or less, is that objects which compare equal must have the same hash value.

In the default CPython implementation of Python, for objects of type str, bytes or datetime, a hard-to-invert hash function is used (SipHash) and moreover hash randomization is used since Python 3.3 to prevent hash-flooding attacks. This means that the hash changes across invocations of Python. So if you're trying to invert hash() to recover a string with that hash, forget about it.

But for numeric values (including int, float, Decimal, Fraction), a simple and elegant hashing function is used since 2010, which has the property that for equal numeric values their hashes are equal even if they are of different types (like 42 and 42.0 and Decimal('42') and Fraction(42, 1) and complex(42, 0)). The function used is as follows (ignore negative numbers for a moment):

  • For an integer n, the hash is given simply by hash(n) = n mod P, where P is sys.hash_info.modulus = 261 - 1, a large prime.

  • This is generalized to all finite rational numbers as follows: for a rational number n = p/q, the hash is hash(p/q) = (p/q) mod P, with the division being interpreted modulo P. In other words, if p/q is in lowest form (or at least, common factors of P removed), then hash(p/q) is got by computing the inverse of q mod P, multiplying that with p, and taking the result modulo P.

  • For negative numbers, hash(-n) = -hash(n).

(There are a couple more details on special values: if n is floating-point infinity, or if q has no inverse, i.e. is a multiple of P, then sys.hash_info.inf is used, and if n is a NaN then sys.hash_info.nan is used. Also the hash value is never -1.)

This makes it a nice exercise to invert this hash function. For a given nonnegative value h, where 0 ≤ h < P,

  • An integer n has hash(n) = h if and only if n mod P is h, i.e. if n is of the form (h + kP) for some integer k.

  • A (IEEE 754 double-precision) floating-point number with 52 mantissa bits m and 11 exponent bits e represents the rational number

    (252 + m)/252 × 2e-1023

    so if its hash is h, then we have the congruence conditions:

    (252 + m) (2e-1023-52) ≡ h mod P

    (252 + m) ≡ ((21023+52-e) × h) mod P

    m ≡ (21023+52-e × h - 252) mod P

    with the only constraint on m being that 0 ≤ m < 252. So for each of the 211 = 2048 possible values of e, we can compute the corresponding m and verify whether it leads to a valid floating-point number.

So here is a (Python 3) function that for a given h yields all int and float values n such that hash(n) is h.

import sys

def invert(h):
    if h == -1: return []  # -1 gets coerced to -2 so no value has hash -1
    if h < 0:
        sign = -1
        h = -h
    else:
        sign = 1
    M = sys.float_info.mant_dig - 1  # = 52 = Bits available for mantissa
    E = (sys.float_info.max_exp - sys.float_info.min_exp + 1)  # = 1023 = bias
    B = sys.float_info.radix  # = 2, base of the floating point values
    P = sys.hash_info.modulus  # = 2^61 - 1 = the prime used as hash modulus
    if not (0 <= h == int(h) < P):
        return []
    for e in range((E + 1) * 2):
        # Want m such that (B^M + m) * B^(e-M-E) = h mod P
        m = (h * B**(M+E-e) - B**M) % P
        if m >= B**M: continue  # Happens with probability (1-B**M/P)
        f = (B**M + m) * B**(e-M-E)
        if f == int(f): continue  # We'll see this later as an integer
        assert hash(f) == h
        yield sign * f
    # Special values if any
    if h == sys.hash_info.inf:
        yield sign * float('inf')
    if h == sys.hash_info.nan:
        yield float('nan')
    # Now the integers
    k = 0
    while True:
        yield sign * (h + k * P)
        k += 1

Example usage:

num = 0
for n in invert(314159265):
    print(hash(n), n)
    num += 1
    if num > 25: break

Output:

314159265 2.1332628416727795e-304
314159265 4.918969210286518e-286
314159265 1.1342370766076572e-267
314159265 2.6153726338867434e-249
314159265 6.030638704336553e-231
314159265 1.390570609748797e-212
314159265 3.2064375193072873e-194
314159265 7.393541538375207e-176
314159265 1.7048346069593532e-157
314159265 3.9310809603228e-139
314159265 9.064455551013383e-121
314159265 2.0901211464632472e-102
314159265 4.81949123398199e-84
314159265 1.111299016984405e-65
314159265 2.5624810694595406e-47
314159265 5.908679060255712e-29
314159265 1.3624486304777972e-10
314159265 314159265
314159265 2305843009527853216
314159265 4611686018741547167
314159265 6917529027955241118
314159265 9223372037168935069
314159265 11529215046382629020
314159265 13835058055596322971
314159265 16140901064810016922
314159265 18446744074023710873

etc.

Sign up to request clarification or add additional context in comments.

2 Comments

How long did the example usage take to run?
@ruohola It's very fast: if I change it to return a list to avoid the I/O (print) then it runs in about 6 ms, and with the print it's about 60 milliseconds. It's basically a loop over 1075 values of the exponent e, for each a single constant-time computation of the corresponding mantissa bits m. The infinite loop after that is a straightforward one with a single computation per iteration.
9

You can't theoretically do that, at least not in an efficient manner (read: "in reasonable time"), even if the hash is not cryptographically secure.

Now if your search space is small enough (say, for example, if the only possible input is a list of 1000 words), you might pre-compute a sorted table of all possible hashes (as a key) and their corresponding inputs and perform a O(log(n)) a lookup on that.

This would of course give you a list of possible results, as hashes are not unique. Now, again, if your search space is small enough, you may only have unique results for each and every input. But we can't say anything sure about it unless we know more about the source of your data.

4 Comments

Of course, in the specific case of smallish integers with Python's hash, you don't need such a lookup table: hash(12) == 12.
but hash(10**2000) == 2342378340969425830
@Dougal: Indeed. Replaced with a more meaningful example. Thanks.
@Elazar Yeah; it seems to work in approximately [-2**60, 2**60] (except for -1, which is apparently used to signal internal errors and replaced by -2). That page also shows the old Python string hash algorithm, pre-randomization.
5

You can’t, and it’s not unique. That’s what makes it a hash. From help(hash):

Return a hash value for the object. Two objects with the same value have the same hash value. The reverse is not necessarily true, but likely.

So this isn’t really possible in general. You can check a certain list for a matching hash, but you can never be sure it was the original unless you know that the original is in some set and doesn’t have a collision with another item in that set.

Comments

2

An inverse hash function would not be (in general) unique even if you could invert it. For example, there are an infinite number of strings from which hash keys are generated into a finite integer range limited by the word size on your machine.

Comments

0

Hashes are meant to be computationally expensive to reverse. Generally the only way to "reverse" them is to bruteforce the input that was used to generate the output.

Comments

0

Another point that people are missing isn't just that its hard to find a string that matches a hash, but also that there isn't enough information there to determine what the string was.

A hash is (usually) a cryptographic way of converting a given input into an integer that is unreversible. However, it is possible for hashes to clash or collide, which is possible in MD5. As such, under such hashing functions, the number of different strings which could hash to the same number is infinite - so even if it were possible to reverse (its not), you still wouldn't know which string was the original!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.