74

I have two dataframes with the following columns:

df1.columns
//  Array(ts, id, X1, X2)

and

df2.columns
//  Array(ts, id, Y1, Y2)

After I do

val df_combined = df1.join(df2, Seq(ts,id))

I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?

1
  • If you defined the join columns as a Seq of strings (for the columns names), then the columns should not be duplicated. See my answer below. Commented Apr 20, 2017 at 2:16

10 Answers 10

62

The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.

Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.

Here is the left dataframe:

val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))

val left = llist.toDF("firstname","lastname","date","duration")

left.show()

/*
+---------+--------+----------+--------+
|firstname|lastname|      date|duration|
+---------+--------+----------+--------+
|      bob|       b|2015-01-13|       4|
|    alice|       a|2015-04-23|      10|
+---------+--------+----------+--------+
*/

Here is the right dataframe:

val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")

right.show()

/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
|    alice|       a|   100|
|      bob|       b|    23|
+---------+--------+------+
*/

Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").

The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:

left.join(right, left("firstname")===right("firstname") &&
                 left("lastname")===right("lastname")).show

/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname|      date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
|      bob|       b|2015-01-13|       4|      bob|       b|    23|
|    alice|       a|2015-04-23|      10|    alice|       a|   100|
+---------+--------+----------+--------+---------+--------+------+
*/

The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:

left.join(right, Seq("firstname", "lastname")).show

/*
+---------+--------+----------+--------+------+
|firstname|lastname|      date|duration|upload|
+---------+--------+----------+--------+------+
|      bob|       b|2015-01-13|       4|    23|
|    alice|       a|2015-04-23|      10|   100|
+---------+--------+----------+--------+------+
*/
Sign up to request clarification or add additional context in comments.

9 Comments

actually the output DF does have duplicates using the following; val joined = sampledDF.join(idsDF, idColumns, "inner") . and where idColumns is a Seq[String] containing the join columns
I don't think this works if the names of the columns in the two datasets are different.
What to do when out of 4 join exprs, 2 have different columns in both tables but 2 refers to same columns on both tables. rename?
This will not work if columns have null values and its a null comparision
What if the common columns on which we join have different names in different dataframes?
|
35

This is an expected behavior. DataFrame.join method is equivalent to SQL join like this

SELECT * FROM a JOIN b ON joinExprs

If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:

val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???

a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent 
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))

or use aliases:

// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")

For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:

val usingColumns: Seq[String] = ???

a.join(b, usingColumns)

or as single string

val usingColumn: String = ???

a.join(b, usingColumn)

which keep only one copy of columns used in a join condition.

6 Comments

Instead of select, can I drop the duplicate column?
Yes, but only via parents not with aliases.
How about an outer join? Any rows without a match will have a null in one of the table's key columns, but you don't know ahead of time which one to drop. Is there a way to handle that case elegantly?
@Darryl coalesce and drop both.
In the joined dataframe, i want the column name as something other than input table's column name. Is there any way to do this ?. For example : Instead of having the column name as "foo" which is being taken from "b" dataframe, I want to have the column name as "column_new". Something like this sql query : "select b.foo as column_new"
|
10

I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.

Say a is

scala> val a  = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]

scala> a.show
+---+----+
|key|vala|
+---+----+
|  a|   1|
|  b|   2|
+---+----+
and 
scala> val b  = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]

scala> b.show
+---+----+
|key|valb|
+---+----+
|  a|   1|
+---+----+

and I can do this to select only the value in dataframe a:

scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
|  a|   1|
|  b|   2|
+---+----+

2 Comments

what does " a.columns.map(a(_)) : _* " do?
@Nick01 it selects the columns from 'a' data frame (see the answer here: stackoverflow.com/questions/39909863/…)
7

You can simply use this

df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")

Here TYPE-OF-JOIN can be

  • left
  • right
  • inner
  • fullouter

For example, I have two dataframes like this:

// df1
word   count1
w1     10   
w2     15  
w3     20

// df2
word   count2
w1     100   
w2     150  
w5     200

If you do fullouter join then the result looks like this

df1.join(df2, Seq("word"),"fullouter").show()

word   count1  count2
w1     10      100
w2     15      150
w3     20      null
w5     null    200

5 Comments

How do you add in a condition here, say col("count1") > 10 say
I think you can do something like df1.join(df2, Seq("word"),"fullouter").filter($"count1">10).show() this. Let me know if it doesn't work.
how do you import Seq?
I think you don't need to import anything. If it doesn't work for you then try this import spark.implicits._
how can this work for null safe joins?
7

try this,

val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))

Comments

2

This is a normal behavior from SQL, what I am doing for this:

  • Drop or Rename source columns
  • Do the join
  • Drop renamed column if any

Here I am replacing "fullname" column:

Some code in Java:

this
    .sqlContext
    .read()
    .parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
    .drop("fullname")
    .registerTempTable("data_original");

this
    .sqlContext
    .read()
    .parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
    .registerTempTable("data_v2");

 this
    .sqlContext
    .sql(etlQuery)
    .repartition(1)
    .write()
    .mode(SaveMode.Overwrite)
    .parquet(outputPath);

Where the query is:

SELECT
    d.*,
   concat_ws('_', product_name, product_module, name) AS fullname
FROM
    {table_source} d
LEFT OUTER JOIN
    {table_updates} u ON u.id = d.id

This is something you can do only with Spark I believe (drop column from list), very very helpful!

Comments

2

If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._

val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")

df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")

spark.sql("select * from table1  inner join  table2  using (c1, c2)").show(false)

/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1  |4  |3  |3  |
|5  |2  |4  |4  |
|7  |4  |5  |10 |
+---+---+---+---+
*/

Comments

1

Inner Join is default join in spark, Below is simple syntax for it.

leftDF.join(rightDF,"Common Col Nam")

For Other join you can follow the below syntax

leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")

If columns Name are not common then

leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")

1 Comment

OP is asking about dropping duplicate columns after join I think you missed that point.
1

Best practice is to make column name different in both the DF before joining them and drop accordingly.

df1.columns =[id, age, income]
df2.column=[id, age_group]

df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')

will return an error while error for duplicate columns

Try this instead try this:

df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')

Comments

0

After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.

Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.

Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")

Where deDupeDfCols is defined as:

def deDupeDfCols(df, separator=''):
    newcols = []

    for col in df.columns:
        if col not in newcols:
            newcols.append(col)
        else:
            for i in range(2, 1000):
                if (col + separator + str(i)) not in newcols:
                    newcols.append(col + separator + str(i))
                    break

    return df.toDF(*newcols)

The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].

Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.