The emptyValue option converts empty strings in the csv file into the emptyValue specified when read into the dataframe, not the other way around:
with open("c.csv", "w") as f:
f.write('''id,val
1,
2,emptyStr
3,str1
4,""
''')
# Using emptyValue
df = spark.read.csv('c.csv', header=True, emptyValue='empty_string_value')
df.show()
+---+------------------+
| id| val|
+---+------------------+
| 1| NULL|
| 2| emptyStr|
| 3| str1|
| 4|empty_string_value|
+---+------------------+
# Without using emptyValue
df = spark.read.csv('c.csv', header=True)
df.show()
+---+--------+
| id| val|
+---+--------+
| 1| NULL|
| 2|emptyStr|
| 3| str1|
| 4| NULL|
+---+--------+
When writing back to csv, spark will convert empty strings to the emptyValue specified:
df = df.withColumn(
"transformed_val",
when(col("val") == "empty_string_value", "").otherwise(col("val"))
)
df.show()
+---+------------------+---------------+
| id| val|transformed_val|
+---+------------------+---------------+
| 1| NULL| NULL|
| 2| emptyStr| emptyStr|
| 3| str1| str1|
| 4|empty_string_value| |
+---+------------------+---------------+
df.write \
.mode("overwrite") \
.option("header", True) \
.option("emptyValue", "EMPTY") \
.csv('csv_path')
with open("csv_path/part-XXX.csv", "r") as f:
print(f.read())
id,val,transformed_val
1,,
2,emptyStr,emptyStr
3,str1,str1
4,empty_value,EMPTY
If you want the behavior you're intending in the question, you could instead transform the "emptyStr" value into an empty string or null value:
with open("c.csv", "w") as f:
f.write('''id,val
1,
2,emptyStr
3,str1
''')
df = spark.read.csv('c.csv', header=True)
df = df.withColumn(
"val",
when(col("val") == "emptyStr", "").otherwise(col("val"))
)
df.collect()
[Row(id='1', val=None), Row(id='2', val=''), Row(id='3', val='str1')]