How to replace a dataframe column with another dataframe column

Question

I have two dataframes:

dataframe1

 DATE1|
+----------+
|2017-01-08|
|2017-10-10|
|2017-05-01|

dataframe2

|NAME  | SID|     DATE1|     DATE2|ROLL|  SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0|  8/1/2017|  7 1 2017|3223|  BHABHA|
|ADARSH| 2.0|10-10-2017|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0|  1.5.2017|  1/2/2017| 111|     DAV|

Expected output

  |  NAME| SID|     DATE1|     DATE2|ROLL|  SCHOOL|
  +------+----+----------+----------+----+--------+
  | Sayam|22.0|2017-01-08|  7 1 2017|3223|  BHABHA|
  |ADARSH| 2.0|2017-10-10|10.03.2017| 222|SUNSHINE|
  | SADIM| 1.0|2017-05-01|  1/2/2017| 111|     DAV|

I want to replace the DATE1 column in the dataframe2 with the DATE1 column of the dataframe1. I need a generic solution.

Any help will be appreciated.

I have tried withColumn method as following

dataframe2.withColumn(newColumnTransformInfo._1, dataframe1.col("DATE1").cast(DateType))

But, I'm getting an error:

org.apache.spark.sql.AnalysisException: resolved attribute(s)

koiralo · Accepted Answer · 2018-01-11 11:08:30Z

You cannot add the column from another dataframe

What you could do is join the two dataframes and keep the column you want, Both the dataframe must have a common join column. If you do not have a common column and data is in order you can assign a increasing id for both dataframe and then join.

Here is the simple example of your case

//Dummy data
  val df1 = Seq(
    ("2017-01-08"),
    ("2017-10-10"),
    ("2017-05-01")
  ).toDF("DATE1")

  val df2 = Seq(
    ("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
    ("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
    ("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
  ).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")

  //create new Dataframe1 with new column id
  val rows1 = df1.rdd.zipWithIndex().map{
    case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
  val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))

  //create new Dataframe2 with new column id
  val rows2= df2.rdd.zipWithIndex().map{
    case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
  val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))


  dataframe2.drop("DATE1")
    .join(dataframe1, "id")
    .drop("id").show()

Output:

+------+----+----------+----+--------+----------+
|  NAME| SID|     DATE2|ROLL|  SCHOOL|     DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0|  7 1 2017|3223|  BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0|  1/2/2017| 111|     DAV|2017-05-01|
+------+----+----------+----+--------+----------+

Hope this helps!

Now it is working . I will let you know if any issues for other test cases.

Collectives™ on Stack Overflow

How to replace a dataframe column with another dataframe column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related