0

Let's say I have two DataFrames -- df1 and df2 -- both with the columns foo and bar. The column foo is a CRC32 hash value like 123456, the column bar is a boolean field that defaults to False.

In pyspark, what is an efficient way to compare the values of foo across the two DataFrames, writing the column bar to True in the event they do not match.

e.g., given the following two DataFrames:

# df1
foo    | bar
-------|------
123456 | False
444555 | False
666777 | False
888999 | False

# df2
foo    | bar
-------|------
938894 | False
129803 | False
666777 | False
888999 | False

I woud like a new DataFrame that looks like the following, with two True columns where they hashes have changed:

# df3
foo    | bar
-------|------
938894 | True <---
129803 | True <---
666777 | False
888999 | False

Any guidance would be much appreciated.

UPDATE 7/1/2018

After successful use of the accepted answer for quite some time, encountered a situation makes the solution not a great fit. If multiple rows from one of the joined DataFrames have the same value for foo as a row from the other DataFrame in the join, it results in a cartesian product growth of rows on that shared value.

In my case, I had I had CRC32 hash values based on an empty string, which results in 0 for the hash. I also should have added, that I do have a unique string to match the rows on, under id here (may have oversimplified situation), and perhaps this is the thing to join on:

It would create situations like this:

# df1
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |444555 | False
ghi  |0      | False
jkl  |0      | False

# df2
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |999999 | False
ghi  |666777 | False
jkl  |0      | False

And with the selected answer, would get a DataFrame with more rows than desired:

# df3
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |999999 | True <---
ghi  |0      | False
jkl  |0      | False
jkl  |0      | False # extra row add through join

I'm going to keep the answer as selected, because it's a great answer to the question as originally posed. But, any suggestions for how to handle DataFrames where the column foo may match, would be appreciated.

ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER

I was over complicating the issue without the id column to join on. When using that, it's relatively straightforward to join and write transformed column based on direct comparison of fingerprint column:

df2.alias("df2").join(df1.alias("df1"), df1.id == df2.id, 'left')\
    .select(f.col('df2.foo'), f.when(df1.fingerprint != df2.fingerprint, f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
    .show(truncate=False)

2 Answers 2

2

A aliased left join of df2 with df1 and use of when function to check for the not matched logic should give you your desired output

df2.alias("df2").join(df1.alias("df1"), df1.foo == df2.foo, 'left')\
    .select(f.col('df2.foo'), f.when(f.isnull(f.col('df1.foo')), f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
    .show(truncate=False)

which should give you

+------+-----+
|foo   |bar  |
+------+-----+
|129803|true |
|938894|true |
|888999|false|
|666777|false|
+------+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

That's the business! Works great, thanks so much. Selecting the column with a when/otherwise statement, and then aliasing as bar, that was the key to understanding how I could use that in other contexts. Thanks!
0

I would suggest using a left join and write the code such that when the data is null then you output false and vice versa.

1 Comment

Thanks, I had figured the same, but having trouble envisioning the syntax.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.