0

I have a JSON structure with multiple nested arrays, like this:

{
    "id": 10,
    "packages": [
        {
            "packageId": 1010,
            "clusters": [
                {
                    "fieldClusterId": 101010,
                    "fieldDefinitions": [
                        {
                            "fieldId": 101011112999,
                            "fieldName": "EntityId"
                        }
                    ]
                }
            ]
        }
    ]
}

I'm using spark sql to flatten the array to something like this:

id packageId fieldClusterId fieldId fieldName
10 1010 101010 101011112999 EntityId

The query ends up being a fairly ugly spark-sql cte with multiple steps:

%sql
with cte as(
  select 
    id
    explode(packages) as packages_exploded
  from temp),
cte2 as (
  select
    id,
    packages_exploded.packageId,
    explode(packages_exploded.clusters) as col
  from cte),
cte3 as (
  select 
    id,
    packageId,
    col.fieldClusterId
    explode(col.fieldDefinitions) as col
  from cte2)
select 
    productId,
    productName,
    fieldClusterId,
    fieldClusterName,
    col.*
from cte3

Is there a nicer syntax to accomplish this multiple level explosion?

2
  • Is it required to solve it only using SQL? Commented Jun 10, 2021 at 20:44
  • yes the process is metadata driven and requires a spark-sql query Commented Jun 10, 2021 at 20:46

1 Answer 1

1

This is the way I would implement:

SELECT id,
       packageId,
       fieldClusterId,
       inline(fieldDefinitions)
FROM
  (SELECT id,
          packageId,
          inline(clusters)
   FROM
     (SELECT id,
             inline(packages)
      FROM TABLE_NAME))
Sign up to request clarification or add additional context in comments.

2 Comments

i'm hoping for a cleaner syntax than a multi-stage cte or succession of subqueries. do you think it's possible to do without either of these complexities?
It is not possible because you're exploding arrays from a nested object. If packages, clusters and fieldDefinitions were array columns in a flatten structure you could use inline(arrays_zip(packages, clusters, fieldDefinitions)) but that's not the case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.