1

DF1 - flat dataframe with data

+---------+--------+-------+                                                    
|FirstName|LastName| Device|
+---------+--------+-------+
|   Robert|Williams|android|
|    Maria|Sharpova| iphone|
+---------+--------+-------+

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- Device: string (nullable = true)

DF2 - empty dataframe with same column names

+------+----+
|header|body|
+------+----+
+------+----+

root
 |-- header: struct (nullable = true)
 |    |-- FirstName: string (nullable = true)
 |    |-- LastName: string (nullable = true)
 |-- body: struct (nullable = true)
 |    |-- Device: string (nullable = true)

DF2 schema Code:

val schema = StructType(Array(
StructField("header", StructType(Array(
StructField("FirstName", StringType),
StructField("LastName", StringType)))), 
StructField("body", StructType(Array(
StructField("Device", StringType))))
))

DF2 with data from DF1 would be the final output.

Need to do this for multiple columns for a complex schema and make it configurable. Have to do this without using case class.


APPROACH #1 - use schema.fields.map to map DF1 -> DF2?

APPROACH #2 - create a new DF and define data and schema?

APPROACH #3 - use zip and map transformations to define 'select col as col' query.. don't know if this would work for nested (structtype) schema

How would I go on about doing that?

1
  • There is no such a thing as copying one dataframe into another in Spark. Dataframes, similarly to RDDs, are immutable, please spend some time to understand the importance of immutability in Spark since it is a basic and crucial concept at the same time. The only way to combine value from two dataframes is through join, although in this case you don't have a join condition, neither common columns. Your problem could be expressed better by transforming DF1 into the schema of DF2. One way to implement that is using struct as nicely shown below by @mvasyliv Commented Mar 28, 2021 at 11:11

1 Answer 1

1
import spark.implicits._
import org.apache.spark.sql.functions._

val sourceDF = Seq(
  ("Robert", "Williams", "android"),
  ("Maria", "Sharpova", "iphone")
).toDF("FirstName", "LastName", "Device")

val resDF = sourceDF
  .withColumn("header", struct('FirstName, 'LastName))
  .withColumn("body", struct(col("Device")))
  .select('header, 'body)

resDF.printSchema
//  root
//  |-- header: struct (nullable = false)
//  |    |-- FirstName: string (nullable = true)
//  |    |-- LastName: string (nullable = true)
//  |-- body: struct (nullable = false)
//  |    |-- Device: string (nullable = true)

resDF.show(false)
//  +------------------+---------+
//  |header            |body     |
//  +------------------+---------+
//  |[Robert, Williams]|[android]|
//  |[Maria, Sharpova] |[iphone] |
//  +------------------+---------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.