In Databricks/Spark/Python (Spark version 2.4.0 using pyspark), I'm getting a collection from MongoDB with a field that contains an array of different objects that can be nested. I'd like to convert this to some kind of schema/struct that I can select on.
I've tried many different approaches but can't find an elegant way to convert this to a schema/struct.
Simplified JSON:
{
"id" : "abc123",
"parent" : [
{
"field1" : "1"
},
{
"field1" : "11"
},
{
"field2" : "2",
"field3" : {
"field3a" : "3a",
"field3b" : "3b"
},
},
{
"field4" : "4",
"field5" : "5",
},
{
"field4" : "44",
"field5" : "55",
}
]
}
The objects under the parent can be different across parents so it's overly complex to define a specific schema for all cases. Also note that fields can occur multiple times for a parent.
Approach 1: Auto schema. Using spark.read.format("com.mongodb.spark.sql.DefaultSource") results in a parent field that has a mix of all fields with a lot of null values.
Approach 2: JSON functions. Databricks has a good article on Transforming Complex Data Types. It reads like struct("*") or json_tuple or another function could be used here but I couldn't find any combination that worked successfully.
Approach 3: Dynamic schema. Using this schema was somewhat successful but doesn't handle nested fields and also forces all field values to string.
schema = (StructType()
.add("id", StringType())
.add("parent", StringType())
)
df = get_my_mongdb_collection_with_schema_function(..., schema)
parent_schema = ArrayType(
MapType(StringType(), StringType())
)
df = df.withColumn('parent', from_json(df['parent'], parent_schema))