Scala Spark - copy data from 1 Dataframe into another DF with nested schema & same column names

Question

DF1 - flat dataframe with data

+---------+--------+-------+                                                    
|FirstName|LastName| Device|
+---------+--------+-------+
|   Robert|Williams|android|
|    Maria|Sharpova| iphone|
+---------+--------+-------+

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Device: string (nullable = true)

DF2 - empty dataframe with same column names

+------+----+
|header|body|
+------+----+
+------+----+

root
 |-- header: struct (nullable = true)
 |    |-- FirstName: string (nullable = true)
 |    |-- LastName: string (nullable = true)
 |-- body: struct (nullable = true)
 |    |-- Device: string (nullable = true)

DF2 schema Code:

val schema = StructType(Array(
StructField("header", StructType(Array(
StructField("FirstName", StringType),
StructField("LastName", StringType)))), 
StructField("body", StructType(Array(
StructField("Device", StringType))))
))

DF2 with data from DF1 would be the final output.

Need to do this for multiple columns for a complex schema and make it configurable. Have to do this without using case class.

APPROACH #1 - use schema.fields.map to map DF1 -> DF2?

APPROACH #2 - create a new DF and define data and schema?

APPROACH #3 - use zip and map transformations to define 'select col as col' query.. don't know if this would work for nested (structtype) schema

How would I go on about doing that?

There is no such a thing as copying one dataframe into another in Spark. Dataframes, similarly to RDDs, are immutable, please spend some time to understand the importance of immutability in Spark since it is a basic and crucial concept at the same time. The only way to combine value from two dataframes is through join, although in this case you don't have a join condition, neither common columns. Your problem could be expressed better by transforming DF1 into the schema of DF2. One way to implement that is using struct as nicely shown below by @mvasyliv — abiratsis
– abiratsis, Commented Mar 28, 2021 at 11:11

mvasyliv · Accepted Answer · 2021-03-26 01:19:41Z

1

import spark.implicits._
import org.apache.spark.sql.functions._

val sourceDF = Seq(
  ("Robert", "Williams", "android"),
  ("Maria", "Sharpova", "iphone")
).toDF("FirstName", "LastName", "Device")

val resDF = sourceDF
  .withColumn("header", struct('FirstName, 'LastName))
  .withColumn("body", struct(col("Device")))
  .select('header, 'body)

resDF.printSchema
//  root
//  |-- header: struct (nullable = false)
//  |    |-- FirstName: string (nullable = true)
//  |    |-- LastName: string (nullable = true)
//  |-- body: struct (nullable = false)
//  |    |-- Device: string (nullable = true)

resDF.show(false)
//  +------------------+---------+
//  |header            |body     |
//  +------------------+---------+
//  |[Robert, Williams]|[android]|
//  |[Maria, Sharpova] |[iphone] |
//  +------------------+---------+

answered Mar 26, 2021 at 1:19

mvasyliv

1,2148 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scala Spark - copy data from 1 Dataframe into another DF with nested schema & same column names

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related