Array is subset of another array

Question

In Spark, how to efficiently check if an array is contained in (is a subset of) another array?

Having this as example df, what could be the options?

+------------+------------+
|look_in     |look_for    |
+------------+------------+
|[a, b, c]   |[a]         |
|[a, b, c]   |[d]         |
|[a, b, c]   |[a, b]      |
|[a, b, c]   |[c, d]      |
|[a, b, c]   |[a, b, c]   |
|[a, b, c]   |[a, NULL]   |
|[a, b, NULL]|[a, NULL]   |
|[a, b, NULL]|[a]         |
|[a, b, NULL]|[NULL]      |
|[a, b, c]   |NULL        |
|NULL        |[a]         |
|NULL        |NULL        |
|[a, b, c]   |[a, a]      |
|[a, a, a]   |[a]         |
|[a, a, a]   |[a, a, a]   |
|[a, a, a]   |[a, a, NULL]|
|[a, a, NULL]|[a, a, a]   |
|[a, a, NULL]|[a, a, NULL]|
+------------+------------+

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(['a', 'b', 'c'], ['a']),
     (['a', 'b', 'c'], ['d']),
     (['a', 'b', 'c'], ['a', 'b']),
     (['a', 'b', 'c'], ['c', 'd']),
     (['a', 'b', 'c'], ['a', 'b', 'c']),
     (['a', 'b', 'c'], ['a', None]),
     (['a', 'b',None], ['a', None]),
     (['a', 'b',None], ['a']),
     (['a', 'b',None], [None]),
     (['a', 'b', 'c'], None),
     (None, ['a']),
     (None, None),
     (['a', 'b', 'c'], ['a', 'a']),
     (['a', 'a', 'a'], ['a']),
     (['a', 'a', 'a'], ['a', 'a', 'a']),
     (['a', 'a', 'a'], ['a', 'a',None]),
     (['a', 'a',None], ['a', 'a', 'a']),
     (['a', 'a',None], ['a', 'a',None])],
    ['look_in', 'look_for'])

ZygD · Accepted Answer · 2022-06-21 13:07:07Z

2

forall can do the check for every element in the array in combination with array_contains.

Spark 3.1:

df = df.withColumn('check', F.forall('look_for', lambda x: F.array_contains('look_in', x)))

Spark 3.0:

df = df.withColumn('check', F.expr("forall(look_for, x -> array_contains(look_in, x))"))

Result:

+------------+------------+-----+
|     look_in|    look_for|check|
+------------+------------+-----+
|   [a, b, c]|         [a]| true|
|   [a, b, c]|         [d]|false|
|   [a, b, c]|      [a, b]| true|
|   [a, b, c]|      [c, d]|false|
|   [a, b, c]|   [a, b, c]| true|
|   [a, b, c]|   [a, null]| null|
|[a, b, null]|   [a, null]| null|
|[a, b, null]|         [a]| true|
|[a, b, null]|      [null]| null|
|   [a, b, c]|        null| null|
|        null|         [a]| null|
|        null|        null| null|
|   [a, b, c]|      [a, a]| true|
|   [a, a, a]|         [a]| true|
|   [a, a, a]|   [a, a, a]| true|
|   [a, a, a]|[a, a, null]| null|
|[a, a, null]|   [a, a, a]| true|
|[a, a, null]|[a, a, null]| null|
+------------+------------+-----+

edited Jun 21, 2022 at 13:07

answered Jun 11, 2022 at 8:59

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Yannic Over a year ago

Is there a way to achieve the same in pySpark 2.4?

ZygD Over a year ago

You can try F.size(F.array_intersect('look_for', 'look_in')) == F.size('look_for'). But use with care, as it works in a different way. For null values the function will always return something. Also, duplicates inside arrays would return different results than the script above.

Collectives™ on Stack Overflow

Array is subset of another array

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related