2
  1. That one's scala and this is pyspark.
  2. Scala solution .option("nullValue", null) translates to pyspark's nullValue=None, which produces wrong result as listed below.

TL;DR; -- How to use "" as empty value and nothing as NULL in a csv file?

I have a need where I need to specify an empty string in a csv file, which also has some NULL values. I'm trying to use "" as empty value and nothing as NULL, My expectation was that nullValue=None and emptyValue="" should do what I want, but both get interpreted as NULL.

I tried all combinations of nullValue and emptyValue options.

with open("/dbfs/tmp/c.csv", "w") as f:
    f.write('''id,val
1,
2,""
3,str1
''')

for e, n in [('', None), ('', ''), (None, None), (None, '')]:
    print(f'e: "{e}", n: "{n}"')
    df = spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue=e, nullValue=n).show()

prints:

e: "", n: "None"
+---+-----+
| id|  val|
+---+-----+
|  1| NULL|
|  2| NULL|
|  3| str1|
+---+-----+

e: "", n: ""
+---+-----+
| id|  val|
+---+-----+
|  1| NULL|
|  2| NULL|
|  3| str1|
+---+-----+

e: "None", n: "None"
+---+-----+
| id|  val|
+---+-----+
|  1| NULL|
|  2| NULL|
|  3| str1|
+---+-----+

e: "None", n: ""
+---+-----+
| id|  val|
+---+-----+
|  1| NULL|
|  2| NULL|
|  3| str1|
+---+-----+

PS: It works in scala, just not in python. So I'm guessing it might have something to do with the fact that print("true" if "" else "false") prints "false" in python.

spark.read
    .option("header", "true")
    .option("emptyValue", "")
    .option("nullValue", null)
    .csv("dbfs:/tmp/c.csv").show()

prints:

+---+-----+
| id|  val|
+---+-----+
|  1| NULL|
|  2|     |
|  3| str1|
+---+-----+

I've read:

7
  • @AdrianKlaver 1. If you post this as an answer I can accept it. Would appreciate explanation of why "None" works but None does not. Also shouldn't nullValue="None" produce Row(id='1', val="None"), instead of val=None? ----2. "Similar" question you quoted is for Scala, I saw it before posting, I translated scala's .option("nullValue", null) to pyspark's nullValue=None, which didn't work. ----3. Your explanation of emptyValue option's meaning (to my other post) makes things clearer. Commented Aug 1 at 17:34
  • 1
    1) The explanation is in the answer to the linked to question. 2) The language maybe different, but the underlying process that Spark uses to determine the values is the same. 3) The explanation of emptyValue in your other post was not by me. Commented Aug 1 at 17:44
  • Scala's .option("nullValue", null) translates to pyspark's nullValue=None, which didn't work. AND nullValue="None" produces Row(id='1', val=None) instead of Row(id='1', val="None"), which is good for me but not correct if nullValue option is meant to specify the string repr of a null value. E.g. emptyValue='emptyStr' will produce Row(id='2', val='emptyStr'), so why should nullValue="None" produce Row(id='1', val=None) instead of Row(id='1', val="None"). That's the explanation I was asking for. Commented Aug 1 at 17:56
  • 1
    From the other question answer: Two other options may be of interest to you though. emptyValue and nullValue. By default, they are both set to "" but since the null value is possible for any type, it is tested before the empty value that is only possible for string type. Therefore, empty strings are interpreted as null values by default. If you set nullValue to anything but "", like "null" or "none", empty strings will be read as empty strings and not as null values anymore. Commented Aug 1 at 18:03
  • 1
    So by default emptyString and nullValue are set to "", by changing nullValue to not be "" it skips over those values and lets them pass through as an empty string. Commented Aug 1 at 18:04

1 Answer 1

1

The thing to have in mind is that CSV is not a proper designed and specced-out format, for thought to allow seamless data interchange. It happened to show up as a easy-to-do way for editing tabular data in a text file - and gained traction for being convenient to export to and from spreadsheet (At a time the dominant spreadsheet format used a proprietary binary format, in contrast with today's archived XML files).

That said, unless you have a protocol in a layer above CSV itself, which will modify your values read from CSV and can enable special meaning for certain tokens, CSV can't convey the difference between a missing value and an empty value.

If you are in control of both producing your CSV file, and reading the data back, the first, easy, and correct to go advice is simply to not use CSV at all, but instead a format with real specs: that could be either parquet files, or even SQLITE Database files - both will preserve typing (unlike CSV which will always require some level of guessing for a column type on reading), and have support for some special values, like NULLs.

That said, it maybe that spark have some fine tunable options for reading a file which can make "" behave in a way distinct than two commas, meaning a missing value - but that is not part of a CSV "spec' or similar thing. You'd still be better off conveying these data in another format.

That said, the Python stdlib csv module, from Python 3.12 on, and not before that, can behave the way you are expecting for reading, if you pass the value csv.QUOTE_NOTNULL as the quoting argument when creating a csv reader. However, that also means numbers are parsed as strings, and like I said before, have to be converted back to numbers in a layer above the CSV parsing:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,""
3,str1
''')

import csv

print(list(csv.reader(open("c.csv"), quoting=csv.QUOTE_NOTNULL)))



Sign up to request clarification or add additional context in comments.

2 Comments

it is 100% me - the first 3 paragraphs state that there is no easy way to produce the output the OP desired and explain the historic motives for that, without another layer on top of CSV, if you take care to read them, and then this is what happens. The TL;DR of the answer is: avoid CSV
Thanks but this is a very long not-an-answer answer. As Adrian pointed out code doesn't solve the problem listed in OP. It doesn't even produce a Dataframe to begin with. There are a million alternatives to using spark.read.csv(), including open the csv file a notepad, read it with eyes and process it with brain! But that none of these are an answer to question in OP. And "use parquet" is impractical here, this is a unit test data file for feature testing. It's meant to be human readable and easily editable as more complex test cases are added.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.