Compare value in DF struct array spark

Question

I have the following problem to solve with spark/scala

I have this DF

+--------------+--------------------+
|co_tipo_arquiv|          errorCodes|
+--------------+--------------------+
|            05|[10531, 20524, 10...|

this schema:

root
 |-- co_tipo_arquiv: string (nullable = true)
 |-- errorCodes: array (nullable = true)
 |    |-- element: string (containsNull = true)

I need to check if any of the codes in my error list(list_erors) are in the df in the errorCodes column

val list_erors = List("10531","10144")

i try this, but doesn't work

dfNire.filter(col("errorCodes").isin(list_erors)).show()

Daeho Ro · Accepted Answer · 2020-03-11 15:14:16Z

1

Spark 2.4+

You can use the array_intersect function with the array of errors.

val list_errors = Array("10531","10144")

df.withColumn("intersect", array_intersect(col("errors"), lit(list_errors))).show(false)

Then, the result is as follws:

+---+---------------------+---------+
|id |errors               |intersect|
+---+---------------------+---------+
|05 |[10531, 20524, 11111]|[10531]  |
+---+---------------------+---------+

where the column name is temporal for my test.

edited Mar 11, 2020 at 15:14

answered Mar 11, 2020 at 14:22

Daeho Ro

13.7k4 gold badges25 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Leonardo Gusmão Over a year ago

import org.apache.spark.sql.functions.{array_intersect, array, lit} No longer finding or array_intersect in spark functions im using scala 2.11.11 and spark 2.3.0

Daeho Ro Over a year ago

@LeonardoGusmão Oh I see, this function is usable at the spark version 2.4+.

notNull · Accepted Answer · 2020-03-11 14:44:18Z

0

If you want to check/list out if the array contains any list_errors then:

df.show()
//+------------------+
//|        errorCodes|
//+------------------+
//|[10531, 20254, 10]|
//|              [10]|
//+------------------+


def is_exists_any(s: Seq[String]): UserDefinedFunction = udf((c: collection.mutable.WrappedArray[String]) => c.toList.intersect(s).nonEmpty)

val list_errors = Seq("10531", "10144")

df.withColumn("is_exists",is_exists_any(list_errors)(col("errorCodes"))).filter(col("is_exists") === true).show()

//+------------------+---------+
//|        errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]|     true|
//+------------------+---------+

Another way to get rows without using udf would be using array_intersect and then only list out the rows where size of array is not 0.

df.withColumn("is_exists", array_intersect(col("errorCodes"), lit(list_errors))).
filter(size(col("is_exists")) !==0).
show()
//+------------------+---------+
//|        errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]|  [10531]|
//+------------------+---------+

edited Mar 11, 2020 at 14:44

answered Mar 11, 2020 at 14:38

notNull

31.8k4 gold badges41 silver badges58 bronze badges

1 Comment

Leonardo Gusmão Over a year ago

Nice Dude, it worked. Thank you. Brazil thanks you for your help. haha

Collectives™ on Stack Overflow

Compare value in DF struct array spark

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related