In file 1, the JSON element "image" is nested. The data looks like structured like this:
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
Resulting schema is correctly inferred by Spark:
val df1 = spark.read.json("/xxx/xxxx/xxxx/nested1.json")
df1.printSchema
root
|-- id: string (nullable = true)
|-- image: struct (nullable = true)
| |-- height: long (nullable = true)
| |-- url: string (nullable = true)
| |-- width: long (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
File nested2.json contains some nested elements and some non-nested (below the second line, element "image" is not nested):
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
{"id": "0002", "type": "donut", "name": "CupCake", "image": "images/0001.jpg"}
Resulting schema does not contain the nested data:
val df2 = spark.read.json("/xxx/xxx/xxx/nested2.json")
df2.printSchema
root
|-- id: string (nullable = true)
|-- image: string (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
Why is Spark not able to figure out the schema when non-nested elements are present?
How to process a JSON file containing a mixture of records like this using Spark and Scala?