Let's say I have two DataFrames -- df1 and df2 -- both with the columns foo and bar. The column foo is a CRC32 hash value like 123456, the column bar is a boolean field that defaults to False.
In pyspark, what is an efficient way to compare the values of foo across the two DataFrames, writing the column bar to True in the event they do not match.
e.g., given the following two DataFrames:
# df1
foo | bar
-------|------
123456 | False
444555 | False
666777 | False
888999 | False
# df2
foo | bar
-------|------
938894 | False
129803 | False
666777 | False
888999 | False
I woud like a new DataFrame that looks like the following, with two True columns where they hashes have changed:
# df3
foo | bar
-------|------
938894 | True <---
129803 | True <---
666777 | False
888999 | False
Any guidance would be much appreciated.
UPDATE 7/1/2018
After successful use of the accepted answer for quite some time, encountered a situation makes the solution not a great fit. If multiple rows from one of the joined DataFrames have the same value for foo as a row from the other DataFrame in the join, it results in a cartesian product growth of rows on that shared value.
In my case, I had I had CRC32 hash values based on an empty string, which results in 0 for the hash. I also should have added, that I do have a unique string to match the rows on, under id here (may have oversimplified situation), and perhaps this is the thing to join on:
It would create situations like this:
# df1
id |foo | bar
-----|-------|------
abc |123456 | False
def |444555 | False
ghi |0 | False
jkl |0 | False
# df2
id |foo | bar
-----|-------|------
abc |123456 | False
def |999999 | False
ghi |666777 | False
jkl |0 | False
And with the selected answer, would get a DataFrame with more rows than desired:
# df3
id |foo | bar
-----|-------|------
abc |123456 | False
def |999999 | True <---
ghi |0 | False
jkl |0 | False
jkl |0 | False # extra row add through join
I'm going to keep the answer as selected, because it's a great answer to the question as originally posed. But, any suggestions for how to handle DataFrames where the column foo may match, would be appreciated.
ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER
I was over complicating the issue without the id column to join on. When using that, it's relatively straightforward to join and write transformed column based on direct comparison of fingerprint column:
df2.alias("df2").join(df1.alias("df1"), df1.id == df2.id, 'left')\
.select(f.col('df2.foo'), f.when(df1.fingerprint != df2.fingerprint, f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
.show(truncate=False)