Spark/ Scala- Select Columns Conditionally From Dataframe

Question

I have two hive tables A and B and their respective data frames df_a and df_b

A
+----+----- +-----------+
| id | name | mobile1   |
+----+----- +-----------+
| 1  | Matt | 123456798 |
+----+----- +-----------+
| 2  | John | 123456798 |
+----+----- +-----------+
| 3  | Lena |           |
+----+----- +-----------+

B
+----+----- +-----------+
| id | name | mobile2   |
+----+----- +-----------+
| 3  | Lena | 123456798 |
+----+----- +-----------+

And want to perform an operation similar to

select A.name, nvl(nvl(A.mobile1, B.mobile2), 0) from A left outer join B on A.id = B.id

So far I've come up with

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(?)

I can't figure out how to conditionally select either mobile1 or mobile2 or 0 like I did in the Hive query.

Could someone please help me with this? I'm using Spark 1.5.

@mtoto It's not the exact query I've written, but I'm trying to check if table A (df_a) does not have the moile no, then it should be taken from table B (df_b). If still not found, then give 0 as the mobile no — Amber
– Amber, Commented Mar 13, 2017 at 10:42

T. Gawęda · Accepted Answer · 2017-03-13 11:19:46Z

2

Use coalesce:

import org.apache.spark.sql.functions._
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
     coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

If will use mobile1 if it's present, if not - then mobile2, if mobile2 is not present then 0

edited Mar 13, 2017 at 11:19

answered Mar 13, 2017 at 10:46

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Amber Over a year ago

I get this error not found: value coalesce. The spark version I'm using is 1.5.0 and scala is 2.10.4. Could this be a problem?

Amber Over a year ago

Thanks. The error message went away. For some reason, the Ctrl+Shift+O combo didn't work for auto import in my IDE.

Amber Over a year ago

I had another question too- In the select function, I wanted to select all columns of table A in a particular order (the real table has a LOT of columns). I have the column names and order in a string, e.g., "DFA.name,DFA.id,DFA.address". I wanted to select it all like

df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(      "DFA.name,DFA.id,DFA.address",      coalesce(df_a("mobile1"), df_b("mobile2"), lit(0)) )

. Could something like this be accomplished?

T. Gawęda Over a year ago

@Amber Use df_a.columns.filter(here filter names): _*

Amber Over a year ago

Using * wouldn't ensure I get the columns in the order that I needed :( The order that I need is different from the actual order in the table, e.g., The table has order id, name, address but I need name, id, address. I saw that I can't use a string along with columns in the select function. I could use select(df_a("name"), df_a("id"), df_a("address")) but I was hoping there would be a simpler way because this way I'd have to mention over a hundred column names

|

FaigB · Accepted Answer · 2017-03-13 09:45:15Z

1

You can use spark sql's nanvl function. After applying it should be similar to:

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(df_a("name"), nanvl(nanvl(df_a("mobile1"), df_b("mobile2")), 0))

answered Mar 13, 2017 at 9:45

FaigB

2,2811 gold badge16 silver badges22 bronze badges

1 Comment

Amber Over a year ago

The link you provided says Returns col1 if it is not NaN, or col2 if col1 is NaN NaN means Not a Number, right? I wanted the condition to work if no value is present in the table column

Collectives™ on Stack Overflow

Spark/ Scala- Select Columns Conditionally From Dataframe

2 Answers 2

9 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related