How to parse json with mixed nested and non-nested structure?

Question

In file 1, the JSON element "image" is nested. The data looks like structured like this:

{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}

Resulting schema is correctly inferred by Spark:

val df1 = spark.read.json("/xxx/xxxx/xxxx/nested1.json")
df1.printSchema

root
 |-- id: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- width: long (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

File nested2.json contains some nested elements and some non-nested (below the second line, element "image" is not nested):

{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
{"id": "0002", "type": "donut", "name": "CupCake", "image": "images/0001.jpg"}

Resulting schema does not contain the nested data:

val df2 = spark.read.json("/xxx/xxx/xxx/nested2.json")
df2.printSchema

root
 |-- id: string (nullable = true)
 |-- image: string (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

Why is Spark not able to figure out the schema when non-nested elements are present?

How to process a JSON file containing a mixture of records like this using Spark and Scala?

The second record is not a nested one so it will take the schema of the second only. — maxmithun
– maxmithun, Commented Oct 29, 2017 at 12:39

Dror · Accepted Answer · 2020-03-18 06:22:38Z

By not specifying a schema, Spark will infer it from the data source by reading the data once. Looking at the source code for inference of json data, there is a comment in the code:

Infer the type of a collection of json records in three stages:

Infer the type of each record

Merge types by choosing the lowest type necessary to cover equal keys

Replace any remaining null fields with string, the top type

In other words, only the most general data type is returned when there are differences. By mixing nested and un-nested types, only the un-nested will be returned, as all rows contain these values.

Instead of infering the schema, create it yourself and supply it to the method.

val schema = StructType( 
  StructField("id", StringType, true) ::
  StructField("type", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField(
    "image",
    StructType( 
      StructField("height", LongType, true) ::
      StructField("url", StringType, true) ::
      StructField("width", LongType, true) ::
      Nil
    ))
  ) :: Nil
)

Then use the schema when reading:

val df2 = spark.read
  .schema(schema)
  .json("/xxx/xxx/xxx/nested2.json")

However, this approach will result in null for the "image" field when it is simply a string. To get both types, read the file twice and merge the results.

val df3 = spark.read
  .json("/xxx/xxx/xxx/nested2.json")

The new dataframe has a different schema then the first. Let's make them equal and then merge the dataframes together:

val df4 = df3.withColumn("image", 
    struct($"image".as("url"), 
           lit(0L).as("height"), 
           lit(0L).as("width"))
  .select("id", "type", "name", "image")

val df5 = df2.union(df4)

The last select is to make sure the columns are in the same order as df2, otherwise the union will fail.

Collectives™ on Stack Overflow

How to parse json with mixed nested and non-nested structure?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related