Update nested struct in spark dataset from another struct column

Question

I have the following spark dataset with a nested struct type:

-- _1: struct (nullable = false)
 |    |-- _1: struct (nullable = false)
 |    |    |-- _1: struct (nullable = false)
 |    |    |    |-- ni_number: string (nullable = true)
 |    |    |    |-- national_registration_number: string (nullable = true)
 |    |    |    |-- id_issuing_country: string (nullable = true)
 |    |    |    |-- doc_type_name: string (nullable = true)
 |    |    |    |-- brand: string (nullable = true)
 |    |    |    |-- company_name: string (nullable = true)
 |    |    |-- _2: struct (nullable = true)
 |    |    |    |-- municipality: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- brand_name: string (nullable = true)
 |    |    |-- puk: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- customer_servicesegment: string (nullable = true)
 |    |-- customer_category: string (nullable = true)

my aim here is to do some flattening at the bottom of the structype and have this target schema:

-- _1: struct (nullable = false)
|    |-- _1: struct (nullable = false)
|    |    |-- _1: struct (nullable = false)
|    |    |    |-- ni_number: string (nullable = true)
|    |    |    |-- national_registration_number: string (nullable = true)
|    |    |    |-- id_issuing_country: string (nullable = true)
|    |    |    |-- doc_type_name: string (nullable = true)
|    |    |    |-- brand: string (nullable = true)
|    |    |    |-- company_name: string (nullable = true)
|    |    |-- _2: struct (nullable = true)
|    |    |    |-- municipality: string (nullable = true)
|    |    |    |-- country: string (nullable = true)
|    |-- _2: struct (nullable = true)
|    |    |-- brand_name: string (nullable = true)
|    |    |-- puk: string (nullable = true)
|    |-- _3: struct (nullable = true)
|    |    |-- customer_servicesegment: string (nullable = true)
|    |    |-- customer_category: string (nullable = true)

the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)

So here explode utility from spark sql can be used but I don't know where to put it

any help with this please

blackbishop · Accepted Answer · 2022-01-05 14:37:59Z

1

If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:

val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")

This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.

For older versions, you need to reconstruct the struct column _1:

val df2 = df.withColumn(
  "_1", 
  struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")

edited Jan 5, 2022 at 14:37

answered Jan 5, 2022 at 14:28

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

scalacode Over a year ago

I have spark 2.4 @blackbishop

blackbishop Over a year ago

@scalacode then you'll need to recreate the struct _1 when updating it. Please see my edit.

Collectives™ on Stack Overflow

Update nested struct in spark dataset from another struct column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related