0

In Spark, how to efficiently check if an array is contained in (is a subset of) another array?

Having this as example df, what could be the options?

+------------+------------+
|look_in     |look_for    |
+------------+------------+
|[a, b, c]   |[a]         |
|[a, b, c]   |[d]         |
|[a, b, c]   |[a, b]      |
|[a, b, c]   |[c, d]      |
|[a, b, c]   |[a, b, c]   |
|[a, b, c]   |[a, NULL]   |
|[a, b, NULL]|[a, NULL]   |
|[a, b, NULL]|[a]         |
|[a, b, NULL]|[NULL]      |
|[a, b, c]   |NULL        |
|NULL        |[a]         |
|NULL        |NULL        |
|[a, b, c]   |[a, a]      |
|[a, a, a]   |[a]         |
|[a, a, a]   |[a, a, a]   |
|[a, a, a]   |[a, a, NULL]|
|[a, a, NULL]|[a, a, a]   |
|[a, a, NULL]|[a, a, NULL]|
+------------+------------+
from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(['a', 'b', 'c'], ['a']),
     (['a', 'b', 'c'], ['d']),
     (['a', 'b', 'c'], ['a', 'b']),
     (['a', 'b', 'c'], ['c', 'd']),
     (['a', 'b', 'c'], ['a', 'b', 'c']),
     (['a', 'b', 'c'], ['a', None]),
     (['a', 'b',None], ['a', None]),
     (['a', 'b',None], ['a']),
     (['a', 'b',None], [None]),
     (['a', 'b', 'c'], None),
     (None, ['a']),
     (None, None),
     (['a', 'b', 'c'], ['a', 'a']),
     (['a', 'a', 'a'], ['a']),
     (['a', 'a', 'a'], ['a', 'a', 'a']),
     (['a', 'a', 'a'], ['a', 'a',None]),
     (['a', 'a',None], ['a', 'a', 'a']),
     (['a', 'a',None], ['a', 'a',None])],
    ['look_in', 'look_for'])

1 Answer 1

2

forall can do the check for every element in the array in combination with array_contains.

Spark 3.1:

df = df.withColumn('check', F.forall('look_for', lambda x: F.array_contains('look_in', x)))

Spark 3.0:

df = df.withColumn('check', F.expr("forall(look_for, x -> array_contains(look_in, x))"))

Result:

+------------+------------+-----+
|     look_in|    look_for|check|
+------------+------------+-----+
|   [a, b, c]|         [a]| true|
|   [a, b, c]|         [d]|false|
|   [a, b, c]|      [a, b]| true|
|   [a, b, c]|      [c, d]|false|
|   [a, b, c]|   [a, b, c]| true|
|   [a, b, c]|   [a, null]| null|
|[a, b, null]|   [a, null]| null|
|[a, b, null]|         [a]| true|
|[a, b, null]|      [null]| null|
|   [a, b, c]|        null| null|
|        null|         [a]| null|
|        null|        null| null|
|   [a, b, c]|      [a, a]| true|
|   [a, a, a]|         [a]| true|
|   [a, a, a]|   [a, a, a]| true|
|   [a, a, a]|[a, a, null]| null|
|[a, a, null]|   [a, a, a]| true|
|[a, a, null]|[a, a, null]| null|
+------------+------------+-----+
Sign up to request clarification or add additional context in comments.

2 Comments

Is there a way to achieve the same in pySpark 2.4?
You can try F.size(F.array_intersect('look_for', 'look_in')) == F.size('look_for'). But use with care, as it works in a different way. For null values the function will always return something. Also, duplicates inside arrays would return different results than the script above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.