Explode JSON in PySpark SQL

Question

How can I explode the nested JSON data where no name struct /array exist in schema?

For example:

root
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- street: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- data: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- statistic: struct (nullable = true)
 |    |    |    |    |    |    |    |-- a: long (nullable = true)
 |    |    |    |    |    |    |    |-- b: long (nullable = true)
 |    |                        |
                               |-- name: John
                               |-- age:24

The schema using JSON schema reader in Notepad++

items
 -[0]:objects
   -street:[array]
      -[0]:objects
       -statistics:[object]

I tried to load the data into dataframe (using multiline) and then temp table and tried to query. spark.sql("select explode(items) as new_item from TempView").show(1,True) this returns an array but not in tabular form as expected.
explode also didn't work. Could you please help me how can I get into "statistic" as the node object don't have any name to explode. (It has [0].) I want to load the statistic data into table.

Does this give any results? df.select(F.col('items')[0]['street'][0]['data'][0]['statistic']).show()? — mck
– mck, Commented Dec 23, 2020 at 8:59
@mck I am getting the value..but how can i do this for all key?like explode function — Pikun95
– Pikun95, Commented Dec 23, 2020 at 9:23
@mck sorry will not be able to provide data due to data privacy. — Pikun95
– Pikun95, Commented Dec 23, 2020 at 12:11

mck · Accepted Answer · 2020-12-23 15:32:27Z

2

Since you don't upload a sample file, I can only guess how your file looks like. See if this is what you want:

df = spark.read.option('multiline','true').json('test.json')
df2 = (df.select(F.explode('items').alias('items'))
         .select('items.*')
         .select(F.explode('street').alias('street'))
         .select('street.*')
         .select(F.explode('data').alias('data'))
         .select('data.*')
         .select('*', 'statistic.*')
         .drop('statistic'))

df2.show()
+---+----+---+---+
|age|name|  a|  b|
+---+----+---+---+
| 24|John|  1|  2|
| 25|Mary|  2|  3|
+---+----+---+---+

JSON file:

{"items":
    [{"street":
        [{"data":
            [{"statistic": {"a": 1, "b": 2}, "name": "John", "age": 24},
             {"statistic": {"a": 2, "b": 3}, "name": "Mary", "age": 25}
            ]
        }]
    }]
}

Schema:

df.printSchema()
root
root
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- street: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- data: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- age: long (nullable = true)
 |    |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |    |-- statistic: struct (nullable = true)
 |    |    |    |    |    |    |    |-- a: long (nullable = true)
 |    |    |    |    |    |    |    |-- b: long (nullable = true)

edited Dec 23, 2020 at 15:32

answered Dec 23, 2020 at 12:28

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pikun95 Over a year ago

This worked well but one concern,I have some more fields along with "Statistics" (NOT under it)which are not nested(only have key value).I want those as well in my table column. along with "a" and "b" Update the schema FYI accordingly pls check

Pikun95 Over a year ago

what if I have a bunch of array along with "statistics"..assume "details" column.How can i put all together?

mck Over a year ago

@Pikun95 then adapt the answer accordingly by adding more explodes and select asterisks. You should be able to try it yourself :)

Pikun95 Over a year ago

Yes for sure..I will try to workout and practice.

Vijay_Shinde · Accepted Answer · 2020-12-28 11:43:55Z

This can be implementing using Scala as below.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object ReadJson {
  def main(args: Array[String]): Unit = {
    System.setProperty("hadoop.home.dir", "D:\\Software\\Hadoop")
    val spark = SparkSession
      .builder()
      .appName("Testing")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val jsonDF = spark.read.options(Map("multiline"->"true")).json("<local_or_hdfs_path>/sample.json")

    val extractedDF = jsonDF.select(explode($"items").alias("items")).select($"items.*")
        .select(explode($"street").alias("street")).select($"street.*")
        .select(explode($"data").alias("data")).select($"data.*")
        .select("*","statistic.*").drop($"statistic")

    extractedDF.show(false)
  }
}

Collectives™ on Stack Overflow

Explode JSON in PySpark SQL

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related