I’m reading documents from DocDB (MongoDB) into Spark using the mongo-spark-connector.
One of the fields, fieldA, is a nested object. If fieldA is missing in a document, I replace it with an empty string ("") in my query. This setup has been working fine, but recently I ran into an issue.
I was reading about 14,000 documents, and only 4 of them had no fieldA. The rest either had a full nested object or a smaller object with just a few fields. Because of this mix, Spark now throws:
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(subField,StringType,true)) (value: BsonString{value=''})
Here’s the DocDB query I’m using:
"db_fieldA": {
$cond: [
{
$or: [
{ $eq: [ { $ifNull: ["$fieldA", null] }, null ] },
{ $eq: [ { $size: { $objectToArray: "$fieldA" } }, 0 ] }
]
},
"",
"$fieldA"
]
}
Examples of fieldA
Full nested object:
{
"symbol": "ABCD",
"siteName": "example.com",
"desc": "PURCHASE DEMO STORE #9999",
"tags": [
{
"type": "subscription_fee",
"pattern": "monthly",
"contextEvents": [
{
"amount": 99.99,
"date": "2025-01-15"
}
]
}
],
"lineage": {
"source": "system_test",
"version": "1.0",
"processedBy": "ETL-Dummy-Job",
"timestamp": "2025-08-08T12:00:00Z"
}
}
Smaller object (in the case where I'm facing issue):
{
"subField": "some_value"
}
Problematic case (4 documents):
{}
Is there a way to make Spark always treat fieldA as StringType instead of StructType when it infers the schema?
Environment:
- Spark 3.1.2
- Scala 2.12
- mongo-spark-connector_2.12