Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

Question

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found the set difference between the two hashed columns.

df1['hash'] = pd.util.hash_pandas_object(df1, index=False)
df2['hash'] = pd.util.hash_pandas_object(df2, index=False)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]

This results in df1 remaining the same size; that is, none of the hash values matched. However, when I use a lambda function, df1 is reduced by the expected amount.

df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1)
df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]

The problem with the second approach is that it takes an extremely long time to execute (df1 has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object?

SultanOrazbayev · Accepted Answer · 2023-07-09 03:50:13Z

2

The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.

If your objective is to remove the duplicate rows, you can achieve it faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.

df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)
df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows

edited Jul 9, 2023 at 3:50

answered Apr 13, 2021 at 15:32

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CopyOfA Over a year ago

Another SO question (stackoverflow.com/questions/25757042/…) mentioned hash_pandas_object in response to hashing each row, so I assumed it would do this. If not, what does it do exactly. I should've mentioned that I tried the merge method as well, but I'm getting a ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat. I can't figure out that error for my data.

SultanOrazbayev Over a year ago

The comments in that answer make it clear that the most popular answer hashes the complete object rather than every row. The error you have indicates that there's a mismatch in dtype, so that best fixed by specifying relevant dtype, e.g. df.astype({'some_string_col': 'string', 'some_num_col': 'float', # etc}).

CopyOfA Over a year ago

Just to complete this answer, my dataframes were taken from very similar datasets, but for some reason, the datatypes were not matching, so I first changed the datatypes for df2 to those of df1: df2 = df2.astype(df1.dtypes). Then I could merge (merged_df = df1.merge(df2, how='left', indicator=True)), and select the rows only in df1 as @SultanOrazbayev suggested.

Collectives™ on Stack Overflow

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related