I want to read in an ndjson file and apply a pre-defined schema to it (not allow spark to infer the schema). In general terms, this works fine. But we are unable to define certain elements as required.
Here's a really simplistic ndjson file:
{"name": "foo", "id": 1}
{"name": "bar"}
And here's some sample code:
babySchema = StructType([
StructField("id", StringType(), False),
StructField("name", StringType(), True)
])
df = spark.read \
.schema(babySchema) \
.option("mode","permissive") \
.option("columnNameOfCorruptRecord","_corrupt_record") \
.json("/path/to/*.json")
df.show()
+----+----+
| id|name|
+----+----+
| 1| foo|
|null| bar|
+----+----+
Even though the id field is set to not null (StructField("id", StringType(), False)), spark happily processes the record, and just sets the field to null.
How do I enforce that nullability, so that ideally that record would end up in the _corrupt_record column? I tried all 3 modes (permissive, failfast, dropmalformed), no difference.