Filter rows if value exists in array column

Question

NOTE: I'm working in Spark 2.4.4

I have the following dataset

col1

['{"key1": "val1"}','{"key2": "val2"}']
['{"key1": "val1"}','{"key2": "val3"}']

Essentially, I'd like to filter out any rows where key2 is not val2.

col1

['{"key1": "val1"}','{"key2": "val2"}']

In trino SQL, I'm doing it like this:

any_match(col1, x -> json_extract_scalar(x, '$.key2') = 'val2')

But this isn't available in Spark 2.4

My only idea is to explode and then use the following code which isn't efficient.

df.filter(F.get_json_object(F.col("col1"), '$.key2') == 'val2')

I'm wondering if I can do this without exploding in my version of spark (2.4.4)

过过招 · Accepted Answer · 2022-05-12 02:29:50Z

1

For spark >=2.4, you can use the exists function of spark SQL.

df = df.withColumn('flag', F.expr('exists(col1, x -> get_json_object(x, "$.key2") == "val2")')) \
    .filter(F.col('flag')).drop('flag')
df.show(truncate=False)

edited May 12, 2022 at 2:29

answered May 12, 2022 at 2:06

过过招

4,3372 gold badges7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

madst Over a year ago

Awesome! Worked like a charm. Just one small suggestion, you can just use filter(F.col('flag')).

过过招 Over a year ago

I have revised the answer based on your suggestion. Can you accept the answer，thank you！

Collectives™ on Stack Overflow

Filter rows if value exists in array column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related