Scala Spark Replace empty String with NULL

Question

What I want here is to replace a value in a specific column to null if it's empty String.

The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.

val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),

val resolvedDf = df.select(
   df("a"),
   df("b"),
   lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)

In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.

I have tried to fix it with

resolvedDf.na.replace("resolved_id", Map("" -> null))

But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.

I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?

One other way I can fix this is by using when but not sure about the performance

resolvedDf
      .withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))

please look at this question stackoverflow.com/questions/45615621/… — Raman Mishra
– Raman Mishra, Commented Sep 24, 2018 at 18:00

Harneet Singh · Accepted Answer · 2021-11-17 13:04:56Z

2

This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" =!= "", $"resolved_id"))

Basically no need to use otherwise method.

You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507

/**
   * Evaluates a list of conditions and returns one of multiple possible result expressions.
   * If otherwise is not defined at the end, null is returned for unmatched conditions.
   *
   * {{{
   *   // Example: encoding gender string column into integer.
   *
   *   // Scala:
   *   people.select(when(people("gender") === "male", 0)
   *     .when(people("gender") === "female", 1)
   *     .otherwise(2))
   *
   *   // Java:
   *   people.select(when(col("gender").equalTo("male"), 0)
   *     .when(col("gender").equalTo("female"), 1)
   *     .otherwise(2))
   * }}}
   *
   * @group expr_ops
   * @since 1.4.0
   */
  def when(condition: Column, value: Any): Column = this.expr match {
    case CaseWhen(branches, None) =>
      withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
    case CaseWhen(branches, Some(_)) =>
      throw new IllegalArgumentException(
        "when() cannot be applied once otherwise() is applied")
    case _ =>
      throw new IllegalArgumentException(
        "when() can only be applied on a Column previously generated by when() function")
  }

edited Nov 17, 2021 at 13:04

answered Sep 24, 2018 at 18:37

Harneet Singh

2,5963 gold badges18 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Harneet Singh Over a year ago

My point here is that, there is no performance issue with when clause. So you can use it

Harneet Singh Over a year ago

You can check again. Now you don't need to use otherwise method. :)

brendon Over a year ago

=!= for not equal

Collectives™ on Stack Overflow

Scala Spark Replace empty String with NULL

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related