Spark/ Scala- Select Columns From Multiple Dataframes

Ask Question

Asked 8 years, 8 months ago

Modified 8 years, 8 months ago

Viewed 1k times

I have two sample dataframes df_a and df_b

df_a
+----+------+-----------+-----------+
| id | name | mobile1   | address   |
+----+------+-----------+-----------+
| 1  | Matt | 123456798 |           |
+----+------+-----------+-----------+
| 2  | John | 123456798 |           |
+----+------+-----------+-----------+
| 3  | Lena |           |           |
+----+------+-----------+-----------+

df_b 
+----+------+-----------+-------+---------+
| id | name | mobile2   | city  | country |
+----+------+-----------+-------+---------+
| 3  | Lena | 123456798 |Detroit|  USA    |
+----+------+-----------+-------+---------+

and I'm trying to select certain columns from both as follows

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
 df_a("name"), df_a("id"), df_a("address"),
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

I want to do a similar operation on two actual dataframes where df_a has a huge number of columns. I want to select all the columns in df_a in a particular order and two columns from df_b. So I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

and

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

But apparently I'm providing the wrong type of arguments for select and selectExpr

Could someone please help me with this? I'm using Spark 1.5.0.

Update

I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)

and got an error

org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

Then I tried

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

And got

org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

With

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

I got

java.lang.RuntimeException: [1.10] failure: identifier expected

name,id,address,coalesce(mobile1, mobile2, 0)
    ^

edited Mar 14, 2017 at 7:40

asked Mar 13, 2017 at 14:08

Amber

9448 gold badges23 silver badges55 bronze badges

select($"*", coalesce(df_a("mobile1"), df_b("mobile2"), lit(0)))

zero323
– zero323

2017-03-13 15:37:49 +00:00
Commented Mar 13, 2017 at 15:37
@zero323 Thanks for correcting me - I didn't have possibility to check it, I don't know why I thought that it is possible :O

T. Gawęda
– T. Gawęda

2017-03-13 15:42:52 +00:00
Commented Mar 13, 2017 at 15:42
@T.Gawęda Yeah, I am pretty sure I've tried this once or twice myself. If you want to answer this feel free to post $"*" version.

zero323
– zero323

2017-03-13 18:04:19 +00:00
Commented Mar 13, 2017 at 18:04
@zero323 $"*" won't ensure the order of columns I need. I'll see if I can find a way to do that. Is there a way to use val df_a_cols : String = "DFA.name,DFA.id,DFA.address" with $? Like $df_a_cols

Amber
– Amber

2017-03-14 08:54:31 +00:00
Commented Mar 14, 2017 at 8:54
Hi @Amber, I know this is an old thread but did you find a solution?

riyaB
– riyaB

2020-06-01 09:03:59 +00:00
Commented Jun 1, 2020 at 9:03

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Spark/ Scala- Select Columns From Multiple Dataframes

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked