Hashing Pandas dataframe breaks

Question

First import:

import pandas as pd
import numpy as np
import hashlib

Next, consider the following:

np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())

Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1.

However, if we consider the following:

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())

Note that there are strings in the data now. The hash of the arr is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de) for different evaluations, but the one of the pandas.DataFrame changes each time.

Any pythonic way around it? No Pythonic?

Edit: Related links:

Hashable DataFrames

dspencer · Accepted Answer · 2020-03-19 06:45:45Z

2

A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.

answered Mar 19, 2020 at 6:45

dspencer

4,5014 gold badges25 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ankush Khanna · Accepted Answer · 2017-05-26 08:49:30Z

0

According to me when you are using string as values for your cells. Data frame type is object

df.dtypes

shows that. That is why you get different hash each time.

answered May 26, 2017 at 8:49

Ankush Khanna

12 bronze badges

Comments

Dror · Accepted Answer · 2017-05-26 09:30:09Z

0

Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:

print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())

Naturally, this is going to be very length for big dataframes.

Still, the it remains that pd.DataFrame(arr).values != arr, and this is counter-intuitive.

See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691

edited May 26, 2017 at 9:30

answered May 26, 2017 at 9:06

Dror

13.2k24 gold badges100 silver badges171 bronze badges

1 Comment

Dror Over a year ago

Blog post summary

Bao Wei · Accepted Answer · 2020-03-19 06:26:18Z

0

I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.

answered Mar 19, 2020 at 6:26

Bao Wei

1

Collectives™ on Stack Overflow

Hashing Pandas dataframe breaks

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related