Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().
If I understand your schema correctly, the following JSON should be a valid input:
{
"Animal": {
"Species": [
{
"mammal": [
{ "description": "llama" },
{ "description": "sheep" }
]
},
{
"mammal": [
{ "description": "rabbit" },
{ "description": "hare" }
]
}
]
}
}
val df = spark.read.json("data.json")
df.printSchema
// root
// |-- Animal: struct (nullable = true)
// | |-- Species: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- mammal: array (nullable = true)
// | | | | |-- element: struct (containsNull = true)
// | | | | | |-- description: string (nullable = true)
df.select("Animal.Species.mammal").show(false)
// +----------------------------------------+
// |mammal |
// +----------------------------------------+
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// +----------------------------------------+
df.select(flatten(col("Animal.Species.mammal"))).show(false)
// +------------------------------------+
// |flatten(Animal.Species.mammal) |
// +------------------------------------+
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// +------------------------------------+
This is now an array of structs and you can use getField("description") to obtain the array of interest:
df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// +--------------------------------------------------------+
// |flatten(Animal.Species.mammal AS mammal#173).description|
// +--------------------------------------------------------+
// |[llama, sheep, rabbit, hare] |
// +--------------------------------------------------------+
Finally, array_join with separator ", " can be used to obtain the desired string:
df.select(
array_join(
flatten(col("Animal.Species.mammal")).getField("description"),
", "
) as "animals"
).show(false)
// +--------------------------+
// |animals |
// +--------------------------+
// |llama, sheep, rabbit, hare|
// +--------------------------+