Explode Array-Nested Array Spark SQL

Question

I have a JSON structure with multiple nested arrays, like this:

{
    "id": 10,
    "packages": [
        {
            "packageId": 1010,
            "clusters": [
                {
                    "fieldClusterId": 101010,
                    "fieldDefinitions": [
                        {
                            "fieldId": 101011112999,
                            "fieldName": "EntityId"
                        }
                    ]
                }
            ]
        }
    ]
}

I'm using spark sql to flatten the array to something like this:

id	packageId	fieldClusterId	fieldId	fieldName
10	1010	101010	101011112999	EntityId

The query ends up being a fairly ugly spark-sql cte with multiple steps:

%sql
with cte as(
  select 
    id
    explode(packages) as packages_exploded
  from temp),
cte2 as (
  select
    id,
    packages_exploded.packageId,
    explode(packages_exploded.clusters) as col
  from cte),
cte3 as (
  select 
    id,
    packageId,
    col.fieldClusterId
    explode(col.fieldDefinitions) as col
  from cte2)
select 
    productId,
    productName,
    fieldClusterId,
    fieldClusterName,
    col.*
from cte3

Is there a nicer syntax to accomplish this multiple level explosion?

yes the process is metadata driven and requires a spark-sql query — Dumbledore__
– Dumbledore__, Commented Jun 10, 2021 at 20:46

Kafels · Accepted Answer · 2021-06-10 21:02:12Z

1

This is the way I would implement:

SELECT id,
       packageId,
       fieldClusterId,
       inline(fieldDefinitions)
FROM
  (SELECT id,
          packageId,
          inline(clusters)
   FROM
     (SELECT id,
             inline(packages)
      FROM TABLE_NAME))

answered Jun 10, 2021 at 21:02

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dumbledore__ Over a year ago

i'm hoping for a cleaner syntax than a multi-stage cte or succession of subqueries. do you think it's possible to do without either of these complexities?

Kafels Over a year ago

It is not possible because you're exploding arrays from a nested object. If packages, clusters and fieldDefinitions were array columns in a flatten structure you could use inline(arrays_zip(packages, clusters, fieldDefinitions)) but that's not the case.

Collectives™ on Stack Overflow

Explode Array-Nested Array Spark SQL

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related