How to use emptyValue option in pyspark while reading a csv file?

Question

According to docs of csv options:

Property Name Default Meaning

emptyValue (for reading), "" (for writing) Sets the string representation of an empty value.

But it doesn't seem to work:

with open("/dbfs/tmp/c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

spark.read.csv('dbfs:/tmp/c.csv', header=True, emptyValue='emptyStr').collect()

prints:

[Row(id='1', val=None), Row(id='2', val='emptyStr'), Row(id='3', val='str1')]

expected the val='' for id='2' (instead of val='emptyStr').

How do I use emptyValue option? Aim is to be able to specify NULL as well as empty strings in a csv file.

Also see: How to read empty string as well as NULL values from a csv file in pyspark?

jwong · Accepted Answer · 2025-08-01 21:08:44Z

2

The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
4,""
''')

# Using emptyValue
df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
df.show()

+---+------------------+
| id|               val|
+---+------------------+
|  1|              NULL|
|  2|          emptyStr|
|  3|              str1|
|  4|empty_string_value|
+---+------------------+

# Without using emptyValue
df = spark.read.csv('c.csv', header=True)
df.show()

+---+--------+
| id|     val|
+---+--------+
|  1|    NULL|
|  2|emptyStr|
|  3|    str1|
|  4|    NULL|
+---+--------+

When writing back to csv, spark will convert empty strings to the emptyValue specified:

df = df.withColumn(
    "transformed_val",
    when(col("val") == "empty_string_value", "").otherwise(col("val"))
)
df.show()

+---+------------------+---------------+
| id|               val|transformed_val|
+---+------------------+---------------+
|  1|              NULL|           NULL|
|  2|          emptyStr|       emptyStr|
|  3|              str1|           str1|
|  4|empty_string_value|               |
+---+------------------+---------------+



df.write \
.mode("overwrite") \
.option("header", True) \
.option("emptyValue", "EMPTY") \
.csv('csv_path')

with open("csv_path/part-XXX.csv", "r") as f:
    print(f.read())

id,val,transformed_val
1,,
2,emptyStr,emptyStr
3,str1,str1
4,empty_value,EMPTY

If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:

with open("c.csv", "w") as f:
    f.write('''id,val
1,
2,emptyStr
3,str1
''')

df = spark.read.csv('c.csv', header=True)
df = df.withColumn(
    "val",
    when(col("val") == "emptyStr", "").otherwise(col("val"))
)
df.collect()

[Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]

edited Aug 1 at 21:08

answered Aug 1 at 5:33

jwong

1365 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kashyap Aug 1 at 17:18

Pretty confusing feature to put it mildly. Thanks! Any help with the other question linked at the bottom is appreciated How to read empty string as well as NULL values from a csv file in pyspark?. What you suggested (withColumn(..when...)) is what I'm doing right now, but I call it a work-around. I though it should be possible to read a csv with both NULL and empty string values using pyspark (like it can be done in Scala-Spark).

Collectives™ on Stack Overflow

How to use emptyValue option in pyspark while reading a csv file?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related