I'm working on pulling data out use spark.sql() for performance. But I have this incredibly nested JSON that I'm having trouble getting the data out of.
Here is what the schema of the JSON looks like:
root
|-- httpStatus: long (nullable = true)
|-- httpStatusMessage: string (nullable = true)
|-- response: struct (nullable = true)
| |-- body: struct (nullable = true)
| | |-- dataProviders: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- dataProviderId: long (nullable = true)
| | | | |-- drivers: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- driverFirstName: string (nullable = true)
| | | | | | |-- driverId: long (nullable = true)
| | | | | | |-- driverLastName: string (nullable = true)
| | | | | | |-- driverRef: string (nullable = true)
| | | | | | |-- totalDistance: double (nullable = true)
| | | | | | |-- vehicles: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- deviceId: long (nullable = true)
| | | | | | | | |-- deviceRef: string (nullable = true)
| | | | | | | | |-- trips: array (nullable = true)
| | | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | | |-- averageSpeed: double (nullable = true)
| | | | | | | | | | |-- tripDistanceTravelled: double (nullable = true)
| | | | | | | | | | |-- tripDuration: double (nullable = true)
| | | | | | | | | | |-- tripId: string (nullable = true)
| | | | | | | | | | |-- tripStart: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | | | |-- tripStop: struct (nullable = true)
| | | | | | | | | | | |-- heading: double (nullable = true)
| | | | | | | | | | | |-- latitude: double (nullable = true)
| | | | | | | | | | | |-- longitude: double (nullable = true)
| | | | | | | | | | | |-- mileage: double (nullable = true)
| | | | | | | | | | | |-- speed: double (nullable = true)
| | | | | | | | | | | |-- timestamp: string (nullable = true)
| | | | | | | | |-- vehicleId: long (nullable = true)
| | | | | | | | |-- vehicleRef: string (nullable = true)
| |-- header: struct (nullable = true)
| | |-- accelUnit: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- distanceUnit: string (nullable = true)
| | |-- fleetId: long (nullable = true)
| | |-- fleetName: string (nullable = true)
| | |-- gpsUnit: string (nullable = true)
| | |-- speedUnit: string (nullable = true)
|-- timestamp: string (nullable = true)
I've been attempting to explode these fields to get to the most nested field but I'm having trouble getting past arrayType.
Here is a sample of my code:
json_df = spark.read.json('/user/myuser/drivers_directory/driverRates.json')
json_df.printSchema()
json_df.show()
+----------+-----------------+--------------------+-------------------+
|httpStatus|httpStatusMessage| response| timestamp|
+----------+-----------------+--------------------+-------------------+
| 200| success|[[[[14, [[Eric, 1...|2020-11-11T19:46:01|
+----------+-----------------+--------------------+-------------------+
body_df = json_df.select('response.*').show()
json_df.select('response.*').select('body.*').show()
+--------------------+
| dataProviders|
+--------------------+
|[[14, [[Eric, 100...|
+--------------------+
json_df.select('response.*').select('body.*').select('dataProviders.dataProviderId').show()
+--------------+
|dataProviderId|
+--------------+
| [14]|
+--------------+
However doing this for every field is pretty tedious and is terrible for performance.
I've been tryin to use spark.sql() to get everything out but I'm getting errors based on the StructType and arrayType
Wanting something like:
json_df.createOrReplaceTempView('driver_dictionary')
final_driver_df = spark.sql("""select
, httpStatus as status
, httpStatusMessage as message
, timestamp as time
from driver_dictionary
lateral view explode(response) as r
""")
The problem I'm running into is trying to explode the body and the data underneath it. I get StructType errors when I use Lateral view and ArrayType errors when using Lateral View. Some assistance would be greatly appreciated.