Spark SQL QUERY join on Same column name

Question

I am writing a join query for 2 dataframes. I have to perform join on column which has same name in both dataframes. How can I write it in Query?

var df1 = Seq((1,"har"),(2,"ron"),(3,"fred")).toDF("ID", "NAME")
var df2 = Seq(("har", "HARRY"),("ron", "RONALD")).toDF("NAME", "ACTUALNAME")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")

I know we can do df3 = df1.join(df2, Seq("NAME")) where NAME is the common column. In this scenario df3 will have only ID, NAME, ACTUALNAME.

If we do it from SQL then query will be select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME. For this output dataframe will have ID, NAME, NAME, ACTUALNAME columns. How can I remove extra NAME column which came from df2.

This does not work as well spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))

Is there a cleaner way to do this? Renaming df2 columns is the last option which I don't want to use. I have scenario where creating SQL queries is easier than dataframes so looking for only SPARK SQL Specific answers

Mahesh Gupta · Accepted Answer · 2019-10-31 09:26:40Z

2

try this you can use col() for referring column

scala> spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(col("table2.NAME")).show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

answered Oct 31, 2019 at 9:26

Mahesh Gupta

1,90214 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hristo Iliev · Accepted Answer · 2019-10-31 11:51:28Z

This is mostly an academic exercise, but you can also do it without the need to drop columns by switching on the ability of Spark SQL to interpret regular expressions in quoted identifiers, an ability inherited from Hive SQL. You need to set spark.sql.parser.quotedRegexColumnNames to true when building the Spark context for this to work.

$ spark-shell --master "local[*]" --conf spark.sql.parser.quotedRegexColumnNames=true
...
scala> spark.sql("select table1.*, table2.`^(?!NAME$).*$` from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").show()
+---+----+----------+
| ID|NAME|ACTUALNAME|
+---+----+----------+
|  1| har|     HARRY|
|  2| ron|    RONALD|
|  3|fred|      null|
+---+----+----------+

Here

table2.`^(?!NAME$).*$`

resolves to all columns of table2 except NAME. Any valid Java regular expression should work.

pissall · Accepted Answer · 2019-10-31 09:23:31Z

0

If you do not apply an alias to the dataframe, you’ll receive an error after you create your joined dataframe. With two columns named the same thing, referencing one of the duplicate named columns returns an error that essentially says it doesn’t know which one you selected (Ambiguous). In SQL Server and other languages, the SQL engine wouldn’t let that query go through or it would automatically append a prefix or suffix to that field name.

answered Oct 31, 2019 at 9:23

pissall

7,4442 gold badges29 silver badges47 bronze badges

1 Comment

Hristo Iliev Over a year ago

Spark SQL internally prefixes each output column with the table name, it is just not visible. Selecting with a qualified column name works, e.g., select(col("table1.NAME")).

Vikrant Singh Rana · Accepted Answer · 2019-10-31 11:22:33Z

0

we can select the required fields in the sql query like below one

spark.sql("select A.ID,A.NAME,B.ACTUALNAME from table1 A LEFT OUTER JOIN table2 B ON table1.NAME = table2.NAME").show()

edited Oct 31, 2019 at 11:22

Vikrant Singh Rana

4,7497 gold badges36 silver badges81 bronze badges

answered Oct 31, 2019 at 9:34

Giridhar

12 bronze badges

1 Comment

Nikhil Redij Over a year ago

Selection will work but as I mentioned in question it will be last priority. Original dataframe will have 750+ columns so cannot write such big query.

Collectives™ on Stack Overflow

Spark SQL QUERY join on Same column name

4 Answers 4

Comments

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related