PySpark: Check if value in array is in column

Question

I want to check if any value in array:

list = ['dog', 'mouse', 'horse', 'bird']

Appears in PySpark dataframe column:

Text	isList
I like my two dogs	True
I don't know if I want to have a cat	False
Anna sings like a bird	True
Horseland is a good place	True

I found that in case of multiple words people tend to use dog|mouse|horse|bird but I have many of them and I would like to use an array. Could you help me please?

blackbishop · Accepted Answer · 2021-03-09 10:23:36Z

For Spark 3+, you can use any function. Create a lateral array from your list and explode it then groupby the text column and apply any :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.expr("any(lower(text) rlike word)").alias("isList")
)

df1.show(truncate=False)
#+------------------------------------+------+
#|text                                |isList|
#+------------------------------------+------+
#|I like my two dogs                  |true  |
#|Anna sings like a bird              |true  |
#|I don't know if I want to have a cat|false |
#|Horseland is a good place           |true  |
#+------------------------------------+------+

The same with max :

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.max(F.expr("lower(text) rlike word")).alias("isList")
)

If you want to check exact match, you can use arrays_overlap function:

words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])

df1 = df.withColumn(
    'isList',
    F.arrays_overlap(F.split("text", " "), words_expr)
)

mck · Accepted Answer · 2021-03-09 09:42:27Z

If you want to use arrays, you'll need to transform the array with an rlike comparison:

import pyspark.sql.functions as F

word_list = ['dog', 'mouse', 'horse', 'bird']

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')

df2.show(20,0)
+------------------------------------+------+
|text                                |isList|
+------------------------------------+------+
|I like my two dogs                  |true  |
|I don't know if I want to have a cat|false |
|Anna sings like a bird              |true  |
|Horseland is a good place           |true  |
+------------------------------------+------+

A filter operation on the array is also possible, where you test the size of the filtered array (with matching words):

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')

If you fancy using aggregate that's also possible:

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')

Note that all three of these higher order functions require Spark >= 2.4.

Collectives™ on Stack Overflow

PySpark: Check if value in array is in column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related