1

I have two hive tables A and B and their respective data frames df_a and df_b

A
+----+----- +-----------+
| id | name | mobile1   |
+----+----- +-----------+
| 1  | Matt | 123456798 |
+----+----- +-----------+
| 2  | John | 123456798 |
+----+----- +-----------+
| 3  | Lena |           |
+----+----- +-----------+

B
+----+----- +-----------+
| id | name | mobile2   |
+----+----- +-----------+
| 3  | Lena | 123456798 |
+----+----- +-----------+

And want to perform an operation similar to

select A.name, nvl(nvl(A.mobile1, B.mobile2), 0) from A left outer join B on A.id = B.id

So far I've come up with

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(?)

I can't figure out how to conditionally select either mobile1 or mobile2 or 0 like I did in the Hive query.

Could someone please help me with this? I'm using Spark 1.5.

2
  • what is expected output? Commented Mar 13, 2017 at 8:01
  • @mtoto It's not the exact query I've written, but I'm trying to check if table A (df_a) does not have the moile no, then it should be taken from table B (df_b). If still not found, then give 0 as the mobile no Commented Mar 13, 2017 at 10:42

2 Answers 2

2

Use coalesce:

import org.apache.spark.sql.functions._
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
     coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

If will use mobile1 if it's present, if not - then mobile2, if mobile2 is not present then 0

Sign up to request clarification or add additional context in comments.

9 Comments

I get this error not found: value coalesce. The spark version I'm using is 1.5.0 and scala is 2.10.4. Could this be a problem?
Thanks. The error message went away. For some reason, the Ctrl+Shift+O combo didn't work for auto import in my IDE.
I had another question too- In the select function, I wanted to select all columns of table A in a particular order (the real table has a LOT of columns). I have the column names and order in a string, e.g., "DFA.name,DFA.id,DFA.address". I wanted to select it all like df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer").select( "DFA.name,DFA.id,DFA.address", coalesce(df_a("mobile1"), df_b("mobile2"), lit(0)) ). Could something like this be accomplished?
@Amber Use df_a.columns.filter(here filter names): _*
Using * wouldn't ensure I get the columns in the order that I needed :( The order that I need is different from the actual order in the table, e.g., The table has order id, name, address but I need name, id, address. I saw that I can't use a string along with columns in the select function. I could use select(df_a("name"), df_a("id"), df_a("address")) but I was hoping there would be a simpler way because this way I'd have to mention over a hundred column names
|
1

You can use spark sql's nanvl function. After applying it should be similar to:

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(df_a("name"), nanvl(nanvl(df_a("mobile1"), df_b("mobile2")), 0))

1 Comment

The link you provided says Returns col1 if it is not NaN, or col2 if col1 is NaN NaN means Not a Number, right? I wanted the condition to work if no value is present in the table column

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.