What I want here is to replace a value in a specific column to null if it's empty String.
The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.
val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),
val resolvedDf = df.select(
df("a"),
df("b"),
lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)
In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.
I have tried to fix it with
resolvedDf.na.replace("resolved_id", Map("" -> null))
But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.
I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?
One other way I can fix this is by using when but not sure about the performance
resolvedDf
.withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))