0

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.

The schema looks like this: root |-- name: string (nullable = true) |-- lastName: array (nullable = true) | |-- element: string (containsNull = false)

I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?

2 Answers 2

3

You could consider working on the underlying RDD directly.

def my_filter(row):
    if row.name.upper() == 'JOHN':
        for it in row.lastName:
            if it.upper() == 'SMITH':
                yield row

dataframe = dataframe.rdd.flatMap(my_filter).toDF()
Sign up to request clarification or add additional context in comments.

Comments

1

An update in 2019

spark 2.4.0 introduced new functions like array_contains and transform official document now it can be done in sql language

For your problem, it should be

dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')

It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.