0

I want to check if any value in array:

list = ['dog', 'mouse', 'horse', 'bird']

Appears in PySpark dataframe column:

Text isList
I like my two dogs True
I don't know if I want to have a cat False
Anna sings like a bird True
Horseland is a good place True

I found that in case of multiple words people tend to use dog|mouse|horse|bird but I have many of them and I would like to use an array. Could you help me please?

2 Answers 2

1

For Spark 3+, you can use any function. Create a lateral array from your list and explode it then groupby the text column and apply any :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.expr("any(lower(text) rlike word)").alias("isList")
)

df1.show(truncate=False)
#+------------------------------------+------+
#|text                                |isList|
#+------------------------------------+------+
#|I like my two dogs                  |true  |
#|Anna sings like a bird              |true  |
#|I don't know if I want to have a cat|false |
#|Horseland is a good place           |true  |
#+------------------------------------+------+

The same with max :

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.max(F.expr("lower(text) rlike word")).alias("isList")
)

If you want to check exact match, you can use arrays_overlap function:

words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])

df1 = df.withColumn(
    'isList',
    F.arrays_overlap(F.split("text", " "), words_expr)
)
Sign up to request clarification or add additional context in comments.

Comments

1

If you want to use arrays, you'll need to transform the array with an rlike comparison:

import pyspark.sql.functions as F

word_list = ['dog', 'mouse', 'horse', 'bird']

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')

df2.show(20,0)
+------------------------------------+------+
|text                                |isList|
+------------------------------------+------+
|I like my two dogs                  |true  |
|I don't know if I want to have a cat|false |
|Anna sings like a bird              |true  |
|Horseland is a good place           |true  |
+------------------------------------+------+

A filter operation on the array is also possible, where you test the size of the filtered array (with matching words):

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')

If you fancy using aggregate that's also possible:

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')

Note that all three of these higher order functions require Spark >= 2.4.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.