I am using Spark 2.1 and Zeppelin 0.7 to do the following. (this is inspired by the Databricks tutorial (https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html))
I have created the following schema
val jsonSchema = new StructType()
.add("Records", ArrayType(new StructType()
.add("Id", IntegerType)
.add("eventDt", StringType)
.add("appId", StringType)
.add("userId", StringType)
.add("eventName", StringType)
.add("eventValues", StringType)
)
)
to read in the following json 'array' file, which i have in my 'inputPath' directory
{
"Records": [{
"Id": 9550,
"eventDt": "1491810477700",
"appId": "dandb01",
"userId": "985580",
"eventName": "OG: HR: SELECT",
"eventValues": "985087"
},
... other records
]}
val rawRecords = spark.read.schema(jsonSchema).json(inputPath)
I then want to explode these records to get to the individual events
val events = rawRecords.select(explode($"Records").as("record"))
But rawRecords.show() and events.show() are both null.
Any idea what i am doing wrong? In the past i know i should be using JSONL for this, but the Databricks tutorial suggests that the latest version of spark should now support json arrays.