Exploding nested Struct in Spark dataframe

Question

I'm working through a Databricks example. The schema for the dataframe looks like:

> parquetDF.printSchema
root
|-- department: struct (nullable = true)
|    |-- id: string (nullable = true)
|    |-- name: string (nullable = true)
|-- employees: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- firstName: string (nullable = true)
|    |    |-- lastName: string (nullable = true)
|    |    |-- email: string (nullable = true)
|    |    |-- salary: integer (nullable = true)

In the example, they show how to explode the employees column into 4 additional columns:

val explodeDF = parquetDF.explode($"employees") { 
case Row(employee: Seq[Row]) => employee.map{ employee =>
  val firstName = employee(0).asInstanceOf[String]
  val lastName = employee(1).asInstanceOf[String]
  val email = employee(2).asInstanceOf[String]
  val salary = employee(3).asInstanceOf[Int]
  Employee(firstName, lastName, email, salary)
 }
}.cache()
display(explodeDF)

How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:

val explodeDF = parquetDF.select("department.id","department.name")
display(explodeDF)

If I try:

val explodeDF = parquetDF.explode($"department") { 
  case Row(dept: Seq[String]) => dept.map{dept => 
  val id = dept(0) 
  val name = dept(1)
  } 
}.cache()
display(explodeDF)

I get the warning and error:

<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
            case Row(dept: Seq[String]) => dept.map{dept => 
                           ^
<console>:37: error: inferred type arguments [Unit] do not conform to    method explode's type parameter bounds [A <: Product]
  val explodeDF = parquetDF.explode($"department") { 
                                   ^

DHARIN PAREKH · Accepted Answer · 2019-01-30 03:37:14Z

51

In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:

var explodedDf2 = explodedDf.select("department.*","*")

https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

answered Jan 30, 2019 at 3:37

DHARIN PAREKH

6201 gold badge5 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:10:04Z

25

You could use something like that:

var explodeDF = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDeptDF = explodeDeptDF.withColumn("name", explodeDeptDF("department.name"))

which you helped me into and these questions:

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Sep 1, 2016 at 15:54

gsamaras

73.7k50 gold badges210 silver badges330 bronze badges

5 Comments

Feynman27 Over a year ago

A stage failure: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 1403, 10.81.214.49): scala.MatchError: [[789012,Mechanical Engineering]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

gsamaras Over a year ago

@Feynman27 does this help? It seems to match your attempt. I think the problem with my answer is that the employees has also an element, while department has not.

Feynman27 Over a year ago

Yeah, the employees example creates new rows, whereas the department example should only create two new columns.

Tagar Over a year ago

Related question: stackoverflow.com/questions/30008127/…

Saddle Point Over a year ago

Can we do this for all nested columns with renaming at once? For example, department.id -> inner_id, department.name -> inner_name, ...

Feynman27 · Accepted Answer · 2016-09-01 16:24:14Z

3

This seems to work (though maybe not the most elegant solution).

var explodeDF2 = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDF2 = explodeDF2.withColumn("name", explodeDF2("department.name"))

answered Sep 1, 2016 at 16:24

Feynman27

3,2677 gold badges36 silver badges39 bronze badges

1 Comment

Davos Over a year ago

you could

val explodeDF2 = explodeDF.withColumn("id", explodeDF("department.id")).withColumn("name", explodeDF2("department.name"))

Collectives™ on Stack Overflow

Exploding nested Struct in Spark dataframe

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related