1

I have two dataframes:

dataframe1

 DATE1|
+----------+
|2017-01-08|
|2017-10-10|
|2017-05-01|

dataframe2

|NAME  | SID|     DATE1|     DATE2|ROLL|  SCHOOL|
+------+----+----------+----------+----+--------+
| Sayam|22.0|  8/1/2017|  7 1 2017|3223|  BHABHA|
|ADARSH| 2.0|10-10-2017|10.03.2017| 222|SUNSHINE|
| SADIM| 1.0|  1.5.2017|  1/2/2017| 111|     DAV|

Expected output

  |  NAME| SID|     DATE1|     DATE2|ROLL|  SCHOOL|
  +------+----+----------+----------+----+--------+
  | Sayam|22.0|2017-01-08|  7 1 2017|3223|  BHABHA|
  |ADARSH| 2.0|2017-10-10|10.03.2017| 222|SUNSHINE|
  | SADIM| 1.0|2017-05-01|  1/2/2017| 111|     DAV|

I want to replace the DATE1 column in the dataframe2 with the DATE1 column of the dataframe1. I need a generic solution.

Any help will be appreciated.

I have tried withColumn method as following

dataframe2.withColumn(newColumnTransformInfo._1, dataframe1.col("DATE1").cast(DateType))

But, I'm getting an error:

org.apache.spark.sql.AnalysisException: resolved attribute(s)
0

1 Answer 1

3

You cannot add the column from another dataframe

What you could do is join the two dataframes and keep the column you want, Both the dataframe must have a common join column. If you do not have a common column and data is in order you can assign a increasing id for both dataframe and then join.

Here is the simple example of your case

//Dummy data
  val df1 = Seq(
    ("2017-01-08"),
    ("2017-10-10"),
    ("2017-05-01")
  ).toDF("DATE1")

  val df2 = Seq(
    ("Sayam", 22.0, "2017-01-08", "7 1 2017", 3223, "BHABHA"),
    ("ADARSH", 2.0, "2017-10-10", "10.03.2017", 222, "SUNSHINE"),
    ("SADIM", 1.0, "2017-05-01", "1/2/2017", 111, "DAV")
  ).toDF("NAME", "SID", "DATE1", "DATE2", "ROLL", "SCHOOL")

  //create new Dataframe1 with new column id
  val rows1 = df1.rdd.zipWithIndex().map{
    case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
  val dataframe1 = spark.createDataFrame(rows1, StructType(StructField("id", LongType, false) +: df1.schema.fields))

  //create new Dataframe2 with new column id
  val rows2= df2.rdd.zipWithIndex().map{
    case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
  val dataframe2 = spark.createDataFrame(rows2, StructType(StructField("id", LongType, false) +: df2.schema.fields))


  dataframe2.drop("DATE1")
    .join(dataframe1, "id")
    .drop("id").show()

Output:

+------+----+----------+----+--------+----------+
|  NAME| SID|     DATE2|ROLL|  SCHOOL|     DATE1|
+------+----+----------+----+--------+----------+
| Sayam|22.0|  7 1 2017|3223|  BHABHA|2017-01-08|
|ADARSH| 2.0|10.03.2017| 222|SUNSHINE|2017-10-10|
| SADIM| 1.0|  1/2/2017| 111|     DAV|2017-05-01|
+------+----+----------+----+--------+----------+

Hope this helps!

Sign up to request clarification or add additional context in comments.

1 Comment

Now it is working . I will let you know if any issues for other test cases.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.