1

I have this column inside myTable:

myColumn
[red, green]
[green, green, red]

I need to modify it so that I can replace red with 1, green with 2:

myColumn
[1, 2]
[2, 2, 1]

In short, is there a way to apply case clause for each element in the array, row wise?

The closest I've gotten so far:

select replace(replace(to_json(myColumn), 'red', 1), 'green', 2)

On the other hand, in case we have a column of strings, I could simply use:

select (
  case
    when myColumn='red' then 1
    when myColumn='green' then 2
  end
) from myTable;

4 Answers 4

1

Assuming that the dataframe has registered a temporary view named tmp, use the following SQL statement to get the result.

sql = """
    select
        collect_list(
            case col
                when 'red' then 1
                when 'green' then 2
            end)
        myColumn
    from
        (select mid,explode(myColumn) col
        from
            (select monotonically_increasing_id() mid,myColumn
            from tmp)
        )
    group by mid
"""
df = spark.sql(sql)
df.show(truncate=False)

Sign up to request clarification or add additional context in comments.

1 Comment

This would not preserve the order I am afraid.
1

In pure Spark SQL, you could convert your array into a string with concat_ws, make the substitutions with regexp_replace and then recreate the array with split.

select split(
    regexp_replace(
        regexp_replace(
            concat_ws(',', myColumn)
        , 'red', '1')
    , 'green', '2')
, ',') myColumn from df

Comments

1

I could perform a simple transform (Spark 3 onwards)

select transform(myColumn, value ->
  case value
    when 'red' then 1
    when 'green' then 2
  end
from myTable

Comments

0

Let's create some sample data and a map that contains the substitutions: tou want to make

val df = Seq((1, Seq("red", "green")),
             (2, Seq("green", "green", "red")))
         .toDF("id", "myColumn")
val values = Map("red" -> "1", "green" -> "2")

The most straight forward way would be to define a UDF that does exactly what you want:

val replace = udf((x : Array[String]) =>
    x.map(value => values.getOrElse(value, value)))
df.withColumn("myColumn", replace('myColumn)).show
+---+---------+
| id| myColumn|
+---+---------+
|  1|   [1, 2]|
|  2|[2, 2, 1]|
+---+---------+

Without UDFs, you could transform the array into a string with concat_ws using separators that are not in your array. Then we could use string functions to make the edits:

val sep = ","
val replace = values
    .foldLeft(col("myColumn")){ case (column, (key, value)) =>
        regexp_replace(column, sep + key + sep, sep + value + sep) 
    }
df.withColumn("myColumn", concat(lit(sep), concat_ws(sep+sep, 'myColumn), lit(sep)))
  .withColumn("myColumn", regexp_replace(replace, "(^,)|(,$)", ""))
  .withColumn("myColumn", split('myColumn, sep+sep))
  .show

4 Comments

Is it possible to achieve the same purely by using Spark SQL?
By sparkSQL, you mean the sparkSQL API or the actual SQL language?
The actual SQL language, my bad for being vague
No problem. I leave this here because it might help others. It has the advantage of being configurable solely with the values map. I added another answer with a similar approach in SQL.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.