I have data that looks like this:
{"domain_userid":"a","g_id":"1"}
{"domain_userid":"b"}
{"domain_userid":"c","g_id":""}
I'm loading this into a DataFrame with
spark.read.schema(myschema).json("/my/json")
This results in a DataFrame like this:
+--------------------+--------+
| domain_userid|g_id |
+--------------------+--------+
|a | 1 |
|b | null |
|c | |
What I'm looking for is
+--------------------+--------+
| domain_userid|g_id |
+--------------------+--------+
|a | 1 |
|b | null|
|c | null|
I know I could write a udf to map empty strings to null, but my data has many columns (100+) so this seems like there could be a performance penalty because many tranformations are involved. Is there any flag/option on the json parser to just write null from the start?
regexp_replaceinstead of UDF