1

I have the following spark dataset with a nested struct type:

-- _1: struct (nullable = false)
 |    |-- _1: struct (nullable = false)
 |    |    |-- _1: struct (nullable = false)
 |    |    |    |-- ni_number: string (nullable = true)
 |    |    |    |-- national_registration_number: string (nullable = true)
 |    |    |    |-- id_issuing_country: string (nullable = true)
 |    |    |    |-- doc_type_name: string (nullable = true)
 |    |    |    |-- brand: string (nullable = true)
 |    |    |    |-- company_name: string (nullable = true)
 |    |    |-- _2: struct (nullable = true)
 |    |    |    |-- municipality: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- brand_name: string (nullable = true)
 |    |    |-- puk: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- customer_servicesegment: string (nullable = true)
 |    |-- customer_category: string (nullable = true)

my aim here is to do some flattening at the bottom of the structype and have this target schema:

-- _1: struct (nullable = false)
|    |-- _1: struct (nullable = false)
|    |    |-- _1: struct (nullable = false)
|    |    |    |-- ni_number: string (nullable = true)
|    |    |    |-- national_registration_number: string (nullable = true)
|    |    |    |-- id_issuing_country: string (nullable = true)
|    |    |    |-- doc_type_name: string (nullable = true)
|    |    |    |-- brand: string (nullable = true)
|    |    |    |-- company_name: string (nullable = true)
|    |    |-- _2: struct (nullable = true)
|    |    |    |-- municipality: string (nullable = true)
|    |    |    |-- country: string (nullable = true)
|    |-- _2: struct (nullable = true)
|    |    |-- brand_name: string (nullable = true)
|    |    |-- puk: string (nullable = true)
|    |-- _3: struct (nullable = true)
|    |    |-- customer_servicesegment: string (nullable = true)
|    |    |-- customer_category: string (nullable = true)

the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)

So here explode utility from spark sql can be used but I don't know where to put it

any help with this please

1 Answer 1

1

If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:

val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")

This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.


For older versions, you need to reconstruct the struct column _1:

val df2 = df.withColumn(
  "_1", 
  struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")
Sign up to request clarification or add additional context in comments.

2 Comments

I have spark 2.4 @blackbishop
@scalacode then you'll need to recreate the struct _1 when updating it. Please see my edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.