- Read spark csv with empty values without converting to null doesn't answer this one because:
- That one's scala and this is pyspark.
- Scala solution
.option("nullValue", null)translates to pyspark'snullValue=None, which produces wrong result as listed below.
TL;DR; -- How to use "" as empty value and nothing as NULL in a csv file?
I have a need where I need to specify an empty string in a csv file, which also has some NULL values. I'm trying to use "" as empty value and nothing as NULL, My expectation was that nullValue=None and emptyValue="" should do what I want, but both get interpreted as NULL.
I tried all combinations of nullValue and emptyValue options.
with open("/dbfs/tmp/c.csv", "w") as f:
f.write('''id,val
1,
2,""
3,str1
''')
for e, n in [('', None), ('', ''), (None, None), (None, '')]:
print(f'e: "{e}", n: "{n}"')
df = spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue=e, nullValue=n).show()
prints:
e: "", n: "None"
+---+-----+
| id| val|
+---+-----+
| 1| NULL|
| 2| NULL|
| 3| str1|
+---+-----+
e: "", n: ""
+---+-----+
| id| val|
+---+-----+
| 1| NULL|
| 2| NULL|
| 3| str1|
+---+-----+
e: "None", n: "None"
+---+-----+
| id| val|
+---+-----+
| 1| NULL|
| 2| NULL|
| 3| str1|
+---+-----+
e: "None", n: ""
+---+-----+
| id| val|
+---+-----+
| 1| NULL|
| 2| NULL|
| 3| str1|
+---+-----+
PS: It works in scala, just not in python. So I'm guessing it might have something to do with the fact that print("true" if "" else "false") prints "false" in python.
spark.read
.option("header", "true")
.option("emptyValue", "")
.option("nullValue", null)
.csv("dbfs:/tmp/c.csv").show()
prints:
+---+-----+
| id| val|
+---+-----+
| 1| NULL|
| 2| |
| 3| str1|
+---+-----+
I've read:
"None"works butNonedoes not. Also shouldn'tnullValue="None"produceRow(id='1', val="None"), instead ofval=None? ----2. "Similar" question you quoted is for Scala, I saw it before posting, I translated scala's.option("nullValue", null)to pyspark'snullValue=None, which didn't work. ----3. Your explanation ofemptyValueoption's meaning (to my other post) makes things clearer.emptyValuein your other post was not by me..option("nullValue", null)translates to pyspark'snullValue=None, which didn't work. ANDnullValue="None"producesRow(id='1', val=None)instead ofRow(id='1', val="None"), which is good for me but not correct ifnullValueoption is meant to specify the string repr of a null value. E.g.emptyValue='emptyStr'will produceRow(id='2', val='emptyStr'), so why shouldnullValue="None"produceRow(id='1', val=None)instead ofRow(id='1', val="None"). That's the explanation I was asking for.emptyValueandnullValue. By default, they are both set to "" but since the null value is possible for any type, it is tested before the empty value that is only possible for string type. Therefore, empty strings are interpreted as null values by default. If you setnullValueto anything but "", like "null" or "none", empty strings will be read as empty strings and not as null values anymore.emptyStringandnullValueare set to"", by changingnullValueto not be""it skips over those values and lets them pass through as an empty string.