Transform dataframe column with a hash value

Question

This is my dataframe.

  Student  Studendid          Student  Studendid          Student  Studendid   
0   Stud1    1              0   Stud1    ah274as        0   Stud1    1
1   Stud2    2              1   Stud2    ah474as        1   Stud2    2  
2   Stud3    3              2   Stud3    ah454as        2   Stud3    3  
3   Stud4    4      hash    3   Stud4    48sdfds  hash  3   Stud4    4  
4   Stud5    5       ->     4   Stud5    dash241    ->  4   Stud5    5 
5   Stud6    6              5   Stud6    asda212        5   Stud6    6
6   Stud7    7              6   Stud7    askdkj2        6   Stud7    7  
7   Stud8    8              7    Sud8    kadhh23        7   Stud8    8  
8   Stud9    9              8   Stud9    asdhb27        8   Stud9    9

Based on the students, I would like to hash the student ID. I've already tried the hash() function. Unfortunately, I haven't found anything how to hash it back. I would like to hash and then hash back again. What method is there to hash the Studend and to hash it back?

df[Studendid] = df["Student"].hash()

Hashing is a one-way function you can't "hash back". Read more here why hashing is a one-way function — Ch3steR
– Ch3steR, Commented Oct 30, 2020 at 7:51

Ch3steR · Accepted Answer · 2020-10-30 08:14:41Z

4

Like @Ch3steR commented:

This correct assuming every value has a unique "hash value" but there doesn't exist such hash function as of now. Every hash function is collision prone.

# Example for collision
hash(0.1) == hash(230584300921369408)
True

Note: From Python 3.3 values of strings and bytes objects are salted with a random value before the hashing process. This means that the value of the string is modified with a random value that changes every time your interpreter starts. This is done to avoid dictionary hash attack

# Example taken martijn's answer: https://stackoverflow.com/a/27522708/12416453
>>> hash("235")
-310569535015251310

Now, open a new session.

>>> hash("235")
-1900164331622581997

But if only few rows of data you can use:

Use helper dictionary for hash and then for mapping back swap key:values to d1 dictionary and pass to Series.map:

d2 = {hash(x):x  for x in df['Student']}
d1 = {v:k for k, v in d2.items()}

df['Studendid']= df['Student'].map(d1)
df['orig']= df['Studendid'].map(d2)
print (df)
  Student            Studendid   orig
0   Stud1  6001180169368329239  Stud1
1   Stud2 -1507322317280771023  Stud2
2   Stud3 -2262724814055039076  Stud3
3   Stud4   364063172999472918  Stud4
4   Stud5  8548751638627509914  Stud5
5   Stud6  5647607776109616031  Stud6
6   Stud7   729989721669472240  Stud7
7   Stud8  4828368150311261883  Stud8
8   Stud9  8466663427818502594  Stud9

edited Oct 30, 2020 at 8:14

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

answered Oct 30, 2020 at 7:52

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ch3steR Over a year ago

This correct assuming every value has a unique "hash value" but there doesn't exist such hash function as of now. Every hash function is collision prone.

user14540992 Over a year ago

@jezrael thanks for the quick answer! :) I wish you a good day!

user14540992 Over a year ago

@jezrael I think ~ 1 mio - ~ 2 mio

Ch3steR Over a year ago

Added a few relevant details to the answer. Feel free to revert the changes if not ok.

Collectives™ on Stack Overflow

Transform dataframe column with a hash value

1 Answer 1

But if only few rows of data you can use:

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

But if only few rows of data you can use:

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related