I've got a DataFrame in Azure Databricks using PySpark. I need to serialize it as JSON into one or more files. Those files will eventually be uploaded to Cosmos so it's vital for the JSON to be well-formed.
I know how to connect directly to Cosmos to serialize the data directly, but I'm required to create the JSON files for upload to Cosmos at a later time.
I can't give data from my actual DataFrame, but the structure is complex. Each row has embedded objects and some of those have their own embedded objects and arrays of objects.
I assume the problem is with how I'm trying to serialize the data, not how I've transformed it. I've created this simple DataFrame, df, which I think will suffice as an example.
+---------+-------------+
|property1| array1|
+---------+-------------+
| value1|["a","b","c"]|
| value2|["x","y","z"]|
+---------+-------------+
I serialize it to Azure Data Lake Storage Gen2 like so.
df.coalesce(1).write.json(outpath, lineSep=",")
The file will contain this JSON. The rows are not elements in an array and the last row has a trailing comma so this JSON will not cooperate with Cosmos.
{"property1":"value1","array1":["a","b","c"]},
{"property1":"value2","array1":["x","y","z"]},
This JSON uploads as expected.
[{"property1":"value1","array1":["a","b","c"]},
{"property1":"value2","array1":["x","y","z"]}]
I've successfully uploaded single JSON objects (i.e. without [] enclosing them) so any solution that writes each DataFrame row to its own JSON file is a potential winner.
I've tried that by repartitioning but there are always files with multiple rows in them.