1

This is my dataframe.

  Student  Studendid          Student  Studendid          Student  Studendid   
0   Stud1    1              0   Stud1    ah274as        0   Stud1    1
1   Stud2    2              1   Stud2    ah474as        1   Stud2    2  
2   Stud3    3              2   Stud3    ah454as        2   Stud3    3  
3   Stud4    4      hash    3   Stud4    48sdfds  hash  3   Stud4    4  
4   Stud5    5       ->     4   Stud5    dash241    ->  4   Stud5    5 
5   Stud6    6              5   Stud6    asda212        5   Stud6    6
6   Stud7    7              6   Stud7    askdkj2        6   Stud7    7  
7   Stud8    8              7    Sud8    kadhh23        7   Stud8    8  
8   Stud9    9              8   Stud9    asdhb27        8   Stud9    9  

Based on the students, I would like to hash the student ID. I've already tried the hash() function. Unfortunately, I haven't found anything how to hash it back. I would like to hash and then hash back again. What method is there to hash the Studend and to hash it back?

df[Studendid] = df["Student"].hash()
1

1 Answer 1

4

Like @Ch3steR commented:

This correct assuming every value has a unique "hash value" but there doesn't exist such hash function as of now. Every hash function is collision prone.

# Example for collision
hash(0.1) == hash(230584300921369408)
True

Note: From Python 3.3 values of strings and bytes objects are salted with a random value before the hashing process. This means that the value of the string is modified with a random value that changes every time your interpreter starts. This is done to avoid dictionary hash attack

# Example taken martijn's answer: https://stackoverflow.com/a/27522708/12416453
>>> hash("235")
-310569535015251310

Now, open a new session.

>>> hash("235")
-1900164331622581997

But if only few rows of data you can use:

Use helper dictionary for hash and then for mapping back swap key:values to d1 dictionary and pass to Series.map:

d2 = {hash(x):x  for x in df['Student']}
d1 = {v:k for k, v in d2.items()}

df['Studendid']= df['Student'].map(d1)
df['orig']= df['Studendid'].map(d2)
print (df)
  Student            Studendid   orig
0   Stud1  6001180169368329239  Stud1
1   Stud2 -1507322317280771023  Stud2
2   Stud3 -2262724814055039076  Stud3
3   Stud4   364063172999472918  Stud4
4   Stud5  8548751638627509914  Stud5
5   Stud6  5647607776109616031  Stud6
6   Stud7   729989721669472240  Stud7
7   Stud8  4828368150311261883  Stud8
8   Stud9  8466663427818502594  Stud9
Sign up to request clarification or add additional context in comments.

4 Comments

This correct assuming every value has a unique "hash value" but there doesn't exist such hash function as of now. Every hash function is collision prone.
@jezrael thanks for the quick answer! :) I wish you a good day!
@jezrael I think ~ 1 mio - ~ 2 mio
Added a few relevant details to the answer. Feel free to revert the changes if not ok.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.