2

How can I explode the nested JSON data where no name struct /array exist in schema?

For example:

root
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- street: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- data: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- statistic: struct (nullable = true)
 |    |    |    |    |    |    |    |-- a: long (nullable = true)
 |    |    |    |    |    |    |    |-- b: long (nullable = true)
 |    |                        |
                               |-- name: John
                               |-- age:24


The schema using JSON schema reader in Notepad++

items
 -[0]:objects
   -street:[array]
      -[0]:objects
       -statistics:[object]
  1. I tried to load the data into dataframe (using multiline) and then temp table and tried to query. spark.sql("select explode(items) as new_item from TempView").show(1,True) this returns an array but not in tabular form as expected.

  2. explode also didn't work. Could you please help me how can I get into "statistic" as the node object don't have any name to explode. (It has [0].) I want to load the statistic data into table.

4
  • Does this give any results? df.select(F.col('items')[0]['street'][0]['data'][0]['statistic']).show()? Commented Dec 23, 2020 at 8:59
  • @mck I am getting the value..but how can i do this for all key?like explode function Commented Dec 23, 2020 at 9:23
  • Could you show a sample json in your question? Commented Dec 23, 2020 at 10:09
  • @mck sorry will not be able to provide data due to data privacy. Commented Dec 23, 2020 at 12:11

2 Answers 2

2

Since you don't upload a sample file, I can only guess how your file looks like. See if this is what you want:

df = spark.read.option('multiline','true').json('test.json')
df2 = (df.select(F.explode('items').alias('items'))
         .select('items.*')
         .select(F.explode('street').alias('street'))
         .select('street.*')
         .select(F.explode('data').alias('data'))
         .select('data.*')
         .select('*', 'statistic.*')
         .drop('statistic'))

df2.show()
+---+----+---+---+
|age|name|  a|  b|
+---+----+---+---+
| 24|John|  1|  2|
| 25|Mary|  2|  3|
+---+----+---+---+

JSON file:

{"items":
    [{"street":
        [{"data":
            [{"statistic": {"a": 1, "b": 2}, "name": "John", "age": 24},
             {"statistic": {"a": 2, "b": 3}, "name": "Mary", "age": 25}
            ]
        }]
    }]
}

Schema:

df.printSchema()
root
root
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- street: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- data: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- age: long (nullable = true)
 |    |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |    |-- statistic: struct (nullable = true)
 |    |    |    |    |    |    |    |-- a: long (nullable = true)
 |    |    |    |    |    |    |    |-- b: long (nullable = true)
Sign up to request clarification or add additional context in comments.

4 Comments

This worked well but one concern,I have some more fields along with "Statistics" (NOT under it)which are not nested(only have key value).I want those as well in my table column. along with "a" and "b" Update the schema FYI accordingly pls check
what if I have a bunch of array along with "statistics"..assume "details" column.How can i put all together?
@Pikun95 then adapt the answer accordingly by adding more explodes and select asterisks. You should be able to try it yourself :)
Yes for sure..I will try to workout and practice.
1

This can be implementing using Scala as below.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object ReadJson {
  def main(args: Array[String]): Unit = {
    System.setProperty("hadoop.home.dir", "D:\\Software\\Hadoop")
    val spark = SparkSession
      .builder()
      .appName("Testing")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val jsonDF = spark.read.options(Map("multiline"->"true")).json("<local_or_hdfs_path>/sample.json")

    val extractedDF = jsonDF.select(explode($"items").alias("items")).select($"items.*")
        .select(explode($"street").alias("street")).select($"street.*")
        .select(explode($"data").alias("data")).select($"data.*")
        .select("*","statistic.*").drop($"statistic")

    extractedDF.show(false)
  }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.