3

According to docs of csv options:

Property Name Default Meaning
emptyValue (for reading), "" (for writing) Sets the string representation of an empty value.

But it doesn't seem to work:

with open("/dbfs/tmp/c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='emptyStr').collect()

prints:

[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]

expected the val='' for id='2' (instead of val='emptyStr').

How do I use emptyValue option? Aim is to be able to specify NULL as well as empty strings in a csv file.

Also see: How to read empty string as well as NULL values from a csv file in pyspark?

0

1 Answer 1

2

The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
4,""
''')

# Using emptyValue
df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
df.show()

+---+------------------+
| id|               val|
+---+------------------+
|  1|              NULL|
|  2|          emptyStr|
|  3|              str1|
|  4|empty_string_value|
+---+------------------+

# Without using emptyValue
df = spark.read.csv('c.csv', header=True)
df.show()

+---+--------+
| id|     val|
+---+--------+
|  1|    NULL|
|  2|emptyStr|
|  3|    str1|
|  4|    NULL|
+---+--------+

When writing back to csv, spark will convert empty strings to the emptyValue specified:

df = df.withColumn(
    "transformed_val",
    when(col("val") == "empty_string_value", "").otherwise(col("val"))
)
df.show()

+---+------------------+---------------+
| id|               val|transformed_val|
+---+------------------+---------------+
|  1|              NULL|           NULL|
|  2|          emptyStr|       emptyStr|
|  3|              str1|           str1|
|  4|empty_string_value|               |
+---+------------------+---------------+



df.write \
.mode("overwrite") \
.option("header", True) \
.option("emptyValue", "EMPTY") \
.csv('csv_path')

with open("csv_path/part-XXX.csv", "r") as f:
    print(f.read())

id,val,transformed_val
1,,
2,emptyStr,emptyStr
3,str1,str1
4,empty_value,EMPTY

If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

df = spark.read.csv('c.csv', header=True)
df = df.withColumn(
    "val",
    when(col("val") == "emptyStr", "").otherwise(col("val"))
)
df.collect()

[Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]
Sign up to request clarification or add additional context in comments.

1 Comment

Pretty confusing feature to put it mildly. Thanks! Any help with the other question linked at the bottom is appreciated How to read empty string as well as NULL values from a csv file in pyspark?. What you suggested (withColumn(..when...)) is what I'm doing right now, but I call it a work-around. I though it should be possible to read a csv with both NULL and empty string values using pyspark (like it can be done in Scala-Spark).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.