1

I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.

Schema:

root
 |-- Animal: struct (nullable = true)
 |    |-- Species: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- mammal: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)

Currently I am using this logic. This only gets the first array.

df.select(
   col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)

Pseudo Logic:

col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")

Desired Example Output (flattened elements as string)

llama, sheep, rabbit, hare

2 Answers 2

1

Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().

If I understand your schema correctly, the following JSON should be a valid input:

{
  "Animal": {
    "Species": [
      {
        "mammal": [
          { "description": "llama" },
          { "description": "sheep" }
        ]
      },
      {
        "mammal": [
          { "description": "rabbit" },
          { "description": "hare" }
        ]
      }
    ]
  }
}
val df = spark.read.json("data.json")
df.printSchema
// root
//  |-- Animal: struct (nullable = true)
//  |    |-- Species: array (nullable = true)
//  |    |    |-- element: struct (containsNull = true)
//  |    |    |    |-- mammal: array (nullable = true)
//  |    |    |    |    |-- element: struct (containsNull = true)
//  |    |    |    |    |    |-- description: string (nullable = true)

df.select("Animal.Species.mammal").show(false)
// +----------------------------------------+
// |mammal                                  |
// +----------------------------------------+
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// +----------------------------------------+

df.select(flatten(col("Animal.Species.mammal"))).show(false)
// +------------------------------------+
// |flatten(Animal.Species.mammal)      |
// +------------------------------------+
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// +------------------------------------+

This is now an array of structs and you can use getField("description") to obtain the array of interest:

df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// +--------------------------------------------------------+
// |flatten(Animal.Species.mammal AS mammal#173).description|
// +--------------------------------------------------------+
// |[llama, sheep, rabbit, hare]                            |
// +--------------------------------------------------------+

Finally, array_join with separator ", " can be used to obtain the desired string:

df.select(
  array_join(
    flatten(col("Animal.Species.mammal")).getField("description"),
    ", "
  ) as "animals"
).show(false)
// +--------------------------+
// |animals                   |
// +--------------------------+
// |llama, sheep, rabbit, hare|
// +--------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

This is an elegant and simple solution! Awesome!
1

You can apply explode two times: first time on Animal.Species and second time on the result of the first time:

import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
  .withColumn("tmp", explode(col("tmp.mammal")))
  .select("tmp.description")
  .show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.