1

In Databricks/Spark/Python (Spark version 2.4.0 using pyspark), I'm getting a collection from MongoDB with a field that contains an array of different objects that can be nested. I'd like to convert this to some kind of schema/struct that I can select on.

I've tried many different approaches but can't find an elegant way to convert this to a schema/struct.

Simplified JSON:

{ 
    "id" : "abc123", 
    "parent" : [
        {
            "field1" : "1"
        },
        {
            "field1" : "11"
        }, 
        {
            "field2" : "2", 
            "field3" : {
                "field3a" : "3a", 
                "field3b" : "3b"
            }, 
        }, 
        {
            "field4" : "4", 
            "field5" : "5",
        },
        {
            "field4" : "44", 
            "field5" : "55",
        }
    ]
}

The objects under the parent can be different across parents so it's overly complex to define a specific schema for all cases. Also note that fields can occur multiple times for a parent.

Approach 1: Auto schema. Using spark.read.format("com.mongodb.spark.sql.DefaultSource") results in a parent field that has a mix of all fields with a lot of null values.

Approach 2: JSON functions. Databricks has a good article on Transforming Complex Data Types. It reads like struct("*") or json_tuple or another function could be used here but I couldn't find any combination that worked successfully.

Approach 3: Dynamic schema. Using this schema was somewhat successful but doesn't handle nested fields and also forces all field values to string.

schema = (StructType()
  .add("id", StringType())
  .add("parent", StringType())
)

df = get_my_mongdb_collection_with_schema_function(..., schema)

parent_schema = ArrayType(
    MapType(StringType(), StringType())
)

df = df.withColumn('parent', from_json(df['parent'], parent_schema))

1 Answer 1

0

The get_json_object function generally achieves what's needed here. If all JSONPath operators were supported it would be ideal. However, it looks like only the following operators are supported (but have found it difficult to confirm).

$ Root object
. Child operator
[] Subscript operator for array
* Wildcard for []

When reading the data, a schema is specified to force the column containing json to type string.

schema = (StructType()
  .add("id", StringType())
  .add("parent", StringType())
)

I was able to add a column to a dataframe using withColumn or simply using it as part of the select. e.g.

df = df.withColumn('field2', get_json_object(df['parent'], '$[2].field2'))

.select(get_json_object(df['parent'], '$[2].field2').alias('field2'))

Obviously casting could be added here for correct types.

Because my source JSON is an array I'm accessing each object as an array element. So field2 is in the third array element i.e. index = 2. This approach feels brittle because the order of data is now important. However, it's also possible to specify a wildcard array element to select across all array elements e.g. $[*].field2. Also, the child operator can be used to get nested data e.g. $[2].field3.field3a

It's unclear how to best handle duplicate field names but the following JSONPath will return an array of values:

$[*].field1

Return value:
["1", "11"]

Note that I have not considered/tested the performance impact of using get_json_object.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.