4

I have data that looks like this:

{"domain_userid":"a","g_id":"1"}
{"domain_userid":"b"}
{"domain_userid":"c","g_id":""}

I'm loading this into a DataFrame with

spark.read.schema(myschema).json("/my/json") 

This results in a DataFrame like this:

+--------------------+--------+
|       domain_userid|g_id    |
+--------------------+--------+
|a                   | 1      |
|b                   | null   |
|c                   |        |

What I'm looking for is

+--------------------+--------+
|       domain_userid|g_id    |
+--------------------+--------+
|a                   | 1      |
|b                   |    null|
|c                   |    null|

I know I could write a udf to map empty strings to null, but my data has many columns (100+) so this seems like there could be a performance penalty because many tranformations are involved. Is there any flag/option on the json parser to just write null from the start?

1
  • 1
    You can use regexp_replace instead of UDF Commented Mar 31, 2017 at 9:35

2 Answers 2

1

It turns out that the CSV reader has such an option:

nullValue (default empty string): sets the string representation of a null value

However, the json reader has not implemented this option. (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#json-org.apache.spark.sql.Dataset-)

Sign up to request clarification or add additional context in comments.

Comments

0

json file:

{"domain_userid":"","g_id":"1"}
{"domain_userid":"b"}
{"domain_userid":"c","g_id":""}

try this:

df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')

myfunc = f.UserDefinedFunction(lambda *args: map(lambda x: None if x == '' else x, args),
                               returnType=ArrayType(StringType()))
cols = df.columns
df = df.select(myfunc(*cols)).rdd.flatMap(lambda x: x)
df = spark.createDataFrame(df, schema=cols)
df.show()

and output:

+-------------+----+
|domain_userid|g_id|
+-------------+----+
|         null|   1|
|            b|null|
|            c|null|
+-------------+----+

1 Comment

This will work, but it does not answer my question. I asked if there exists a parsing flag/option for treating empty strings as null, so I can solve this without writing a UDF. Also, this solution assumes that all my columns are of StringType, which is not the case (I parse using schema). So you should use df.dtypes instead of df.columns

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.