0

Im joining 2 dataframes like so: val joinCols = Array("first_name", "last_name") val df_subset_joined = df1_subset.as("a").join(df2_subset.as("b"), joinCols, "full_outer") df_subset_joined.show()

This is the result of the above code:

Dataframe of differences between 2 dataframes
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
|     will |    smith|           67|           67|
|   george |  clooney|           67|           67|
|   george |  clooney|           67|           88|
|    blake |   lively|           66|         null|
|    celena|    gomez|         null|            2|
|       eva|    green|           44|           56|
|      null|     null|             |         null|
|     jason|    momoa|           34|           34|
|        ed|  sheeran|           88|         null|
|    lionel|    messi|           88|           88|
|      kyle|   jenner|         null|           56|
|      tom |   cruise|           66|           34|
|      tom |   cruise|           66|           99|
|      brad|     pitt|           99|           78|
|      ryan| reynolds|           45|         null|
+----------+---------+-------------+-------------+

As you can see there are columns with null values.

I run the following code next:

val filter_str = s"a.$col"+" != "+s"b.$col"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()

I get the foll dataframe:

Below is the dataframe of differences between DF1 and DF1 based on the comparison between:
a.loyalty_score != b.loyalty_score
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
|      tom |   cruise|           66|           99|
|      tom |   cruise|           66|           34|
|       eva|    green|           44|           56|
|      brad|     pitt|           99|           78|
|   george |  clooney|           67|           88|
+----------+---------+-------------+-------------+

Why dont I see the rows where there are null values in 1 column and a actual value in another. Shouldnt this satisfy value != null

How can I make my filter statement make the null values appear in the final dataframe

1 Answer 1

1

The reason you don't get any rows where there is null in one column and non-null in the other is that the comparison returns FALSE.

To avoid this, use the null-safe comparison operator <=>, in conjunction with not.

val filter_str = "not(" + s"a.$col"+" <=> "+s"b.$col)" 
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()

From the documentation,

expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.

Arguments:

expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable. Examples:

SELECT 2 <=> 2; true

SELECT 1 <=> '1'; true

SELECT true <=> NULL; false

SELECT NULL <=> NULL; true

Sign up to request clarification or add additional context in comments.

1 Comment

This concept was correct, thanks, but the syntax that worked for me was as follows: val filter_str = "not(" + s"a.$col"+" <=> "+s"b.$col)"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.