I am writing a join query for 2 dataframes. I have to perform join on column which has same name in both dataframes. How can I write it in Query?
var df1 = Seq((1,"har"),(2,"ron"),(3,"fred")).toDF("ID", "NAME")
var df2 = Seq(("har", "HARRY"),("ron", "RONALD")).toDF("NAME", "ACTUALNAME")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
I know we can do df3 = df1.join(df2, Seq("NAME")) where NAME is the common column. In this scenario df3 will have only ID, NAME, ACTUALNAME.
If we do it from SQL then query will be select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME. For this output dataframe will have ID, NAME, NAME, ACTUALNAME columns. How can I remove extra NAME column which came from df2.
This does not work as well spark.sql("select * from table1 LEFT OUTER JOIN table2 ON table1.NAME = table2.NAME").drop(df2("NAME"))
Is there a cleaner way to do this? Renaming df2 columns is the last option which I don't want to use. I have scenario where creating SQL queries is easier than dataframes so looking for only SPARK SQL Specific answers