How to convert array of array (string type) to struct - Spark/Scala?

Question

I have a dataframe as

+---------------------------------------------------------------+---+
|family_name                                                    |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
+---------------------------------------------------------------+---+
root
 |-- family_name: string (nullable = true)
 |-- id: string (nullable = true)

I wish to convert the column fam_name to array of named structs as

`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>

Im able to convert family_name to array as shown below

val sch = ArrayType(ArrayType(StringType))

val fam_array = data
        .withColumn("family_name_clean", regexp_replace($"family_name", "\\[\\[", "["))
        .withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\\]\\]", "]"))
        .withColumn("ar", toArray($"family_name_clean_clean1"))
        //.withColumn("ar1", from_json($"ar", sch))
    fam_array.show(false)
    fam_array.printSchema()

+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name                                                    |id |family_name_clean                                             |family_name_clean_clean1                                     |ar                                                                     |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John,  Doe,  Married,  999-999-9999], [Jane,  Doe,  Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|[Tom, Riddle, Single, 888-888-8888]]                          |[Tom, Riddle, Single, 888-888-8888]                          |[[Tom,  Riddle,  Single,  888-888-8888]]                               |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
 |-- family_name: string (nullable = true)
 |-- id: string (nullable = true)
 |-- family_name_clean: string (nullable = true)
 |-- family_name_clean_clean1: string (nullable = true)
 |-- ar: array (nullable = true)
 |    |-- element: string (containsNull = true)

sch is a schema variable of desired type.

How do I convert column ar to array<struct<>> ?

EDIT:

I'm using Spark 2.3.2

Vincent Doba · Accepted Answer · 2021-11-01 20:40:24Z

To create an array of structs given an array of arrays of strings, you can use struct function to build a struct given a list of columns combined with element_at function to extract column element at a specific index of an array.

To solve your specific problem, as you correctly stated you need to do two things:

First, transform your string to an array of arrays of strings
Then, use this array of arrays of strings to build your struct

In Spark 3.0 and greater

Using Spark 3.0, we can perform all those steps using spark built-in functions.

For the first step, I would do as follows:

first remove [[ and ]] from family_name string using regexp_replace function
then, create first array level by splitting this string using split function
then, create second array level by splitting each element of previous array using transform and split functions

And for the second step, use struct function to build a struct, picking element in arrays using element_at function.

Thus, complete code using Spark 3.0 and greater would be as follows, with data as input dataframe:

import org.apache.spark.sql.functions.{col, element_at, regexp_replace, split, struct, transform}

val result = data
  .withColumn(
    "family_name", 
    transform( 
      split( // first level split
        regexp_replace(col("family_name"), "\\[\\[|]]", ""), // remove [[ and ]]
        "],\\["
      ), 
      x => split(x, ",") // split for each element in first level array
    )
  )
  .withColumn("family_name", transform(col("family_name"), x => struct(
    element_at(x, 1).as("f_name"), // index starts at 1
    element_at(x, 2).as("l_name"),
    element_at(x, 3).as("status"),
    element_at(x, -1).as("ph_no"), // get last element of array
  )))

In Spark 2.X

Using Spark 2.X, we have to rely on an user-defined function. First, we need to define a case class that represent our struct:

case class FamilyName(
  f_name: String, 
  l_name: String, 
  status: String, 
  ph_no: String
)

Then, we define our user-defined function and apply it to our input dataframe:

import org.apache.spark.sql.functions.{col, udf}

val extract_array = udf((familyName: String) => familyName
  .replaceAll("\\[\\[|]]", "")
  .split("],\\[")
  .map(familyName => {
    val explodedFamilyName = familyName.split(",", -1)
    FamilyName(
      f_name = explodedFamilyName(0),
      l_name = explodedFamilyName(1),
      status = explodedFamilyName(2),
      ph_no = explodedFamilyName(explodedFamilyName.length - 1)
    )
  })
)

val result = data.withColumn("family_name", extract_array(col("family_name")))

Result

If you have the following data dataframe:

+---------------------------------------------------------------+---+
|family_name                                                    |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]]                          |id2|
+---------------------------------------------------------------+---+

You get the following result dataframe:

+-----------------------------------------------------------------+---+
|family_name                                                      |id |
+-----------------------------------------------------------------+---+
|[{John,  Doe,  Married,  999-999-9999}, {Jane,  Doe,  Married, }]|id1|
|[{Tom,  Riddle,  Single,  888-888-8888}]                         |id2|
+-----------------------------------------------------------------+---+

having the following schema:

root
 |-- family_name: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- f_name: string (nullable = true)
 |    |    |-- l_name: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- ph_no: string (nullable = true)
 |-- id: string (nullable = true)

Perfect!!!! Thank you for this elegant solution and explanation. It makes perfect sense to me,. Its jus that I'm, using Spark 2.3.2 (I will add this as an edit to my OP).
Can you please show how I can achieve the same in Spark 2.3.X ?

Collectives™ on Stack Overflow

How to convert array of array (string type) to struct - Spark/Scala?

1 Answer 1

In Spark 3.0 and greater

In Spark 2.X

Result

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

In Spark 3.0 and greater

In Spark 2.X

Result

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related