1

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found the set difference between the two hashed columns.

df1['hash'] = pd.util.hash_pandas_object(df1, index=False)
df2['hash'] = pd.util.hash_pandas_object(df2, index=False)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]

This results in df1 remaining the same size; that is, none of the hash values matched. However, when I use a lambda function, df1 is reduced by the expected amount.

df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1)
df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]

The problem with the second approach is that it takes an extremely long time to execute (df1 has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object?

1 Answer 1

2

The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.

If your objective is to remove the duplicate rows, you can achieve it faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.

df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)
df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows
Sign up to request clarification or add additional context in comments.

3 Comments

Another SO question (stackoverflow.com/questions/25757042/…) mentioned hash_pandas_object in response to hashing each row, so I assumed it would do this. If not, what does it do exactly. I should've mentioned that I tried the merge method as well, but I'm getting a ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat. I can't figure out that error for my data.
The comments in that answer make it clear that the most popular answer hashes the complete object rather than every row. The error you have indicates that there's a mismatch in dtype, so that best fixed by specifying relevant dtype, e.g. df.astype({'some_string_col': 'string', 'some_num_col': 'float', # etc}).
Just to complete this answer, my dataframes were taken from very similar datasets, but for some reason, the datatypes were not matching, so I first changed the datatypes for df2 to those of df1: df2 = df2.astype(df1.dtypes). Then I could merge (merged_df = df1.merge(df2, how='left', indicator=True)), and select the rows only in df1 as @SultanOrazbayev suggested.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.