0

I have two sample dataframes df_a and df_b

df_a
+----+------+-----------+-----------+
| id | name | mobile1   | address   |
+----+------+-----------+-----------+
| 1  | Matt | 123456798 |           |
+----+------+-----------+-----------+
| 2  | John | 123456798 |           |
+----+------+-----------+-----------+
| 3  | Lena |           |           |
+----+------+-----------+-----------+

df_b 
+----+------+-----------+-------+---------+
| id | name | mobile2   | city  | country |
+----+------+-----------+-------+---------+
| 3  | Lena | 123456798 |Detroit|  USA    |
+----+------+-----------+-------+---------+

and I'm trying to select certain columns from both as follows

df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
 df_a("name"), df_a("id"), df_a("address"),
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

I want to do a similar operation on two actual dataframes where df_a has a huge number of columns. I want to select all the columns in df_a in a particular order and two columns from df_b. So I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

and

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols,
 coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)

But apparently I'm providing the wrong type of arguments for select and selectExpr

Could someone please help me with this? I'm using Spark 1.5.0.

Update

I tried the following

val df_a_cols : String = "DFA.name,DFA.id,DFA.address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(DFA.mobile1, DFB.mobile2, 0)"
)

and got an error

org.apache.spark.sql.AnalysisException: cannot resolve 'DFA.name,DFA.id,DFA.address,coalesce(DFA.mobile1, DFB.mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

Then I tried

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.select(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

And got

org.apache.spark.sql.AnalysisException: cannot resolve ' name,id,address,coalesce(mobile1, mobile2, 0) ' given input columns id, name, mobile1, address, mobile2, city, country;

With

val df_a_cols : String = "name,id,address"
df_a.as("DFA").join(df_b.as("DFB"), df_a("id") <=> df_b("id"), "left_outer")
.selectExpr(
 df_a_cols+",coalesce(mobile1, mobile2, 0)"
)

I got

java.lang.RuntimeException: [1.10] failure: identifier expected

name,id,address,coalesce(mobile1, mobile2, 0)
    ^
6
  • select($"*", coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))) Commented Mar 13, 2017 at 15:37
  • @zero323 Thanks for correcting me - I didn't have possibility to check it, I don't know why I thought that it is possible :O Commented Mar 13, 2017 at 15:42
  • @T.Gawęda Yeah, I am pretty sure I've tried this once or twice myself. If you want to answer this feel free to post $"*" version. Commented Mar 13, 2017 at 18:04
  • @zero323 $"*" won't ensure the order of columns I need. I'll see if I can find a way to do that. Is there a way to use val df_a_cols : String = "DFA.name,DFA.id,DFA.address" with $? Like $df_a_cols Commented Mar 14, 2017 at 8:54
  • Hi @Amber, I know this is an old thread but did you find a solution? Commented Jun 1, 2020 at 9:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.