2

What I want here is to replace a value in a specific column to null if it's empty String.

The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.

val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),

val resolvedDf = df.select(
   df("a"),
   df("b"),
   lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)

In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.

I have tried to fix it with

resolvedDf.na.replace("resolved_id", Map("" -> null))

But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.

I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?

One other way I can fix this is by using when but not sure about the performance

resolvedDf
      .withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))
1

1 Answer 1

2

This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" =!= "", $"resolved_id"))

Basically no need to use otherwise method.

You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507

/**
   * Evaluates a list of conditions and returns one of multiple possible result expressions.
   * If otherwise is not defined at the end, null is returned for unmatched conditions.
   *
   * {{{
   *   // Example: encoding gender string column into integer.
   *
   *   // Scala:
   *   people.select(when(people("gender") === "male", 0)
   *     .when(people("gender") === "female", 1)
   *     .otherwise(2))
   *
   *   // Java:
   *   people.select(when(col("gender").equalTo("male"), 0)
   *     .when(col("gender").equalTo("female"), 1)
   *     .otherwise(2))
   * }}}
   *
   * @group expr_ops
   * @since 1.4.0
   */
  def when(condition: Column, value: Any): Column = this.expr match {
    case CaseWhen(branches, None) =>
      withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
    case CaseWhen(branches, Some(_)) =>
      throw new IllegalArgumentException(
        "when() cannot be applied once otherwise() is applied")
    case _ =>
      throw new IllegalArgumentException(
        "when() can only be applied on a Column previously generated by when() function")
  }
Sign up to request clarification or add additional context in comments.

3 Comments

My point here is that, there is no performance issue with when clause. So you can use it
You can check again. Now you don't need to use otherwise method. :)
=!= for not equal

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.