Spark: how to parse empty string values as null in json

Question

I have data that looks like this:

{"domain_userid":"a","g_id":"1"}
{"domain_userid":"b"}
{"domain_userid":"c","g_id":""}

I'm loading this into a DataFrame with

spark.read.schema(myschema).json("/my/json")

This results in a DataFrame like this:

+--------------------+--------+
|       domain_userid|g_id    |
+--------------------+--------+
|a                   | 1      |
|b                   | null   |
|c                   |        |

What I'm looking for is

+--------------------+--------+
|       domain_userid|g_id    |
+--------------------+--------+
|a                   | 1      |
|b                   |    null|
|c                   |    null|

I know I could write a udf to map empty strings to null, but my data has many columns (100+) so this seems like there could be a performance penalty because many tranformations are involved. Is there any flag/option on the json parser to just write null from the start?

You can use regexp_replace instead of UDF

T. Gawęda
– T. Gawęda

2017-03-31 09:35:41 +00:00
Commented Mar 31, 2017 at 9:35 — T. Gawęda
– T. Gawęda, Commented Mar 31, 2017 at 9:35

rdeboo · Accepted Answer · 2017-08-21 14:53:43Z

1

It turns out that the CSV reader has such an option:

nullValue (default empty string): sets the string representation of a null value

However, the json reader has not implemented this option. (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#json-org.apache.spark.sql.Dataset-)

answered Aug 21, 2017 at 14:53

rdeboo

3774 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zhang Tong · Accepted Answer · 2017-03-31 09:31:49Z

0

json file:

{"domain_userid":"","g_id":"1"}
{"domain_userid":"b"}
{"domain_userid":"c","g_id":""}

try this:

df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')

myfunc = f.UserDefinedFunction(lambda *args: map(lambda x: None if x == '' else x, args),
                               returnType=ArrayType(StringType()))
cols = df.columns
df = df.select(myfunc(*cols)).rdd.flatMap(lambda x: x)
df = spark.createDataFrame(df, schema=cols)
df.show()

and output:

+-------------+----+
|domain_userid|g_id|
+-------------+----+
|         null|   1|
|            b|null|
|            c|null|
+-------------+----+

answered Mar 31, 2017 at 9:31

Zhang Tong

4,7593 gold badges21 silver badges39 bronze badges

1 Comment

rdeboo Over a year ago

This will work, but it does not answer my question. I asked if there exists a parsing flag/option for treating empty strings as null, so I can solve this without writing a UDF. Also, this solution assumes that all my columns are of StringType, which is not the case (I parse using schema). So you should use df.dtypes instead of df.columns

Collectives™ on Stack Overflow

Spark: how to parse empty string values as null in json

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related