0

I have a dataframe with two columns, listA stored as Seq[String] and valB stored as String. I want to create a third column valC, which will be of Int type and its value is
iff valB is present in listA then 1 otherwise 0

I tried doing the following:

val dfWithAdditionalColumn = df.withColumn("valC", when($"listA".contains($"valB"), 1).otherwise(0))

But Spark failed to execute this and gave the following error:

cannot resolve 'contains('listA', 'valB')' due to data type mismatch: argument 1 requires string type, however, 'listA' is of array type.;

How do I use a array type column value in CASE statement?

Thanks, Devj

0

2 Answers 2

2

You should use array_contains:

import org.apache.spark.sql.functions.{expr, array_contains}

df.withColumn("valC", when(expr("array_contains(listA, valB)"), 1).otherwise(0))
Sign up to request clarification or add additional context in comments.

Comments

1

You can write a simple udf that will check if the element is present in the array :

val arrayContains = udf( (col1: Int, col2: Seq[Int]) => if(col2.contains(col1) ) 1 else 0 )

And then just call it and pass the necessary columns in the correct order :

df.withColumn("hasAInB", arrayContains($"a", $"b" ) ).show

+---+---------+-------+
|  a|        b|hasAInB|
+---+---------+-------+
|  1|   [1, 2]|      1|
|  2|[2, 3, 4]|      1|
|  3|   [1, 4]|      0|
+---+---------+-------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.