0

I have the following problem to solve with spark/scala

I have this DF

+--------------+--------------------+
|co_tipo_arquiv|          errorCodes|
+--------------+--------------------+
|            05|[10531, 20524, 10...|

this schema:

root
 |-- co_tipo_arquiv: string (nullable = true)
 |-- errorCodes: array (nullable = true)
 |    |-- element: string (containsNull = true)

I need to check if any of the codes in my error list(list_erors) are in the df in the errorCodes column

val list_erors = List("10531","10144")

i try this, but doesn't work

dfNire.filter(col("errorCodes").isin(list_erors)).show()

2 Answers 2

1

Spark 2.4+

You can use the array_intersect function with the array of errors.

val list_errors = Array("10531","10144")

df.withColumn("intersect", array_intersect(col("errors"), lit(list_errors))).show(false)

Then, the result is as follws:

+---+---------------------+---------+
|id |errors               |intersect|
+---+---------------------+---------+
|05 |[10531, 20524, 11111]|[10531]  |
+---+---------------------+---------+

where the column name is temporal for my test.

Sign up to request clarification or add additional context in comments.

2 Comments

import org.apache.spark.sql.functions.{array_intersect, array, lit} No longer finding or array_intersect in spark functions im using scala 2.11.11 and spark 2.3.0
@LeonardoGusmão Oh I see, this function is usable at the spark version 2.4+.
0

If you want to check/list out if the array contains any list_errors then:

df.show()
//+------------------+
//|        errorCodes|
//+------------------+
//|[10531, 20254, 10]|
//|              [10]|
//+------------------+


def is_exists_any(s: Seq[String]): UserDefinedFunction = udf((c: collection.mutable.WrappedArray[String]) => c.toList.intersect(s).nonEmpty)

val list_errors = Seq("10531", "10144")

df.withColumn("is_exists",is_exists_any(list_errors)(col("errorCodes"))).filter(col("is_exists") === true).show()

//+------------------+---------+
//|        errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]|     true|
//+------------------+---------+

Another way to get rows without using udf would be using array_intersect and then only list out the rows where size of array is not 0.

df.withColumn("is_exists", array_intersect(col("errorCodes"), lit(list_errors))).
filter(size(col("is_exists")) !==0).
show()
//+------------------+---------+
//|        errorCodes|is_exists|
//+------------------+---------+
//|[10531, 20254, 10]|  [10531]|
//+------------------+---------+

1 Comment

Nice Dude, it worked. Thank you. Brazil thanks you for your help. haha

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.