Write column based on DataFrame join

Question

Let's say I have two DataFrames -- df1 and df2 -- both with the columns foo and bar. The column foo is a CRC32 hash value like 123456, the column bar is a boolean field that defaults to False.

In pyspark, what is an efficient way to compare the values of foo across the two DataFrames, writing the column bar to True in the event they do not match.

e.g., given the following two DataFrames:

# df1
foo    | bar
-------|------
123456 | False
444555 | False
666777 | False
888999 | False

# df2
foo    | bar
-------|------
938894 | False
129803 | False
666777 | False
888999 | False

I woud like a new DataFrame that looks like the following, with two True columns where they hashes have changed:

# df3
foo    | bar
-------|------
938894 | True <---
129803 | True <---
666777 | False
888999 | False

Any guidance would be much appreciated.

UPDATE 7/1/2018

After successful use of the accepted answer for quite some time, encountered a situation makes the solution not a great fit. If multiple rows from one of the joined DataFrames have the same value for foo as a row from the other DataFrame in the join, it results in a cartesian product growth of rows on that shared value.

In my case, I had I had CRC32 hash values based on an empty string, which results in 0 for the hash. I also should have added, that I do have a unique string to match the rows on, under id here (may have oversimplified situation), and perhaps this is the thing to join on:

It would create situations like this:

# df1
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |444555 | False
ghi  |0      | False
jkl  |0      | False

# df2
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |999999 | False
ghi  |666777 | False
jkl  |0      | False

And with the selected answer, would get a DataFrame with more rows than desired:

# df3
id   |foo    | bar
-----|-------|------
abc  |123456 | False
def  |999999 | True <---
ghi  |0      | False
jkl  |0      | False
jkl  |0      | False # extra row add through join

I'm going to keep the answer as selected, because it's a great answer to the question as originally posed. But, any suggestions for how to handle DataFrames where the column foo may match, would be appreciated.

ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER

I was over complicating the issue without the id column to join on. When using that, it's relatively straightforward to join and write transformed column based on direct comparison of fingerprint column:

df2.alias("df2").join(df1.alias("df1"), df1.id == df2.id, 'left')\
    .select(f.col('df2.foo'), f.when(df1.fingerprint != df2.fingerprint, f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
    .show(truncate=False)

Anahcolus · Accepted Answer · 2018-06-03 19:05:25Z

2

A aliased left join of df2 with df1 and use of when function to check for the not matched logic should give you your desired output

df2.alias("df2").join(df1.alias("df1"), df1.foo == df2.foo, 'left')\
    .select(f.col('df2.foo'), f.when(f.isnull(f.col('df1.foo')), f.lit(True)).otherwise(f.col('df2.bar')).alias('bar'))\
    .show(truncate=False)

which should give you

+------+-----+
|foo   |bar  |
+------+-----+
|129803|true |
|938894|true |
|888999|false|
|666777|false|
+------+-----+

answered Jun 3, 2018 at 19:05

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ghukill Over a year ago

That's the business! Works great, thanks so much. Selecting the column with a when/otherwise statement, and then aliasing as bar, that was the key to understanding how I could use that in other contexts. Thanks!

Justin Pihony · Accepted Answer · 2018-06-03 17:26:53Z

0

I would suggest using a left join and write the code such that when the data is null then you output false and vice versa.

answered Jun 3, 2018 at 17:26

Justin Pihony

67.2k20 gold badges154 silver badges185 bronze badges

1 Comment

ghukill Over a year ago

Thanks, I had figured the same, but having trouble envisioning the syntax.

Collectives™ on Stack Overflow

Write column based on DataFrame join

UPDATE 7/1/2018

ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

UPDATE 7/1/2018

ANOTHER UPDATE 7/1/2018, ALTERNATE ANSWER

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related