14

I'm struggling a bit to generate ID of type integer for given string in Python.

I thought the built-it hash function is perfect but it appears that the IDs are too long sometimes. It's a problem since I'm limited to 64bits as maximum length.

My code so far: hash(s) % 10000000000. The input string(s) which I can expect will be in range of 12-512 chars long.

Requirements are:

  • integers only
  • generated from provided string
  • ideally up to 10-12 chars long (I'll have ~5 million items only)
  • low probability of collision..?

I would be glad if someone can provide any tips / solutions.

3

3 Answers 3

19

I would do something like this:

>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("some string".encode('utf-8'))
>>> str(int(m.hexdigest(), 16))[0:12]
'120665287271'

The idea:

  1. Calculate the hash of a string with MD5 (or SHA-1 or ...) in hexadecimal form (see module hashlib)
  2. Convert the string into an integer and reconvert it to a String with base 10 (there are just digits in the result)
  3. Use the first 12 characters of the string.

If characters a-f are also okay, I would do m.hexdigest()[0:12].

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks, it looks great! It does not return integer but it just a matter of casting it back to int. Would be nice if we could go away with the int/str/int coerce dance. Any idea? :)
m.hexdigit() provides a string with 32 characters. So the maximum value is 'f'*32 with 39 digits (=len(str(int('f'*32,16)))). So You can divide by 1E17 in the end. With this solution collisions might be more probably... But I did not thought it through...
m.hexdigit() provides m.digest_size * 2 characters (this might change, depending on the hash function you want to use)
Note: you can also use the string digest(), slice enough bytes from them and convert it to an integer (better to say: interpreting the byte string as an integer)
I had to write "some string".encode('utf-8') instead
|
3

encode utf-8 was needed for mine to work:

def unique_name_from_str(string: str, last_idx: int = 12) -> str:
    """
    Generates a unique id name
    refs:
    - md5: https://stackoverflow.com/questions/22974499/generate-id-from-string-in-python
    - sha3: https://stackoverflow.com/questions/47601592/safest-way-to-generate-a-unique-hash
    (- guid/uiid: https://stackoverflow.com/questions/534839/how-to-create-a-guid-uuid-in-python?noredirect=1&lq=1)
    """
    import hashlib
    m = hashlib.md5()
    string = string.encode('utf-8')
    m.update(string)
    unqiue_name: str = str(int(m.hexdigest(), 16))[0:last_idx]
    return unqiue_name

see my ultimate-utils python library.

Comments

1

If you're not allowed to add extra dependency, you can continue using hash function in the following way:

>>> my_string = "whatever"
>>> str(hash(my_string))[1:13]
'460440266319'

NB:

  • I am ignoring 1st character as it may be the negative sign.
  • hash may return different values for same string, as PYTHONHASHSEED Value will change everytime you run your program. You may want to set it to some fixed value. Read here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.