0

NOTE: I'm working in Spark 2.4.4

I have the following dataset

col1

['{"key1": "val1"}','{"key2": "val2"}']
['{"key1": "val1"}','{"key2": "val3"}']

Essentially, I'd like to filter out any rows where key2 is not val2.

col1

['{"key1": "val1"}','{"key2": "val2"}']

In trino SQL, I'm doing it like this:

any_match(col1, x -> json_extract_scalar(x, '$.key2') = 'val2') 

But this isn't available in Spark 2.4

My only idea is to explode and then use the following code which isn't efficient.

df.filter(F.get_json_object(F.col("col1"), '$.key2') == 'val2')

I'm wondering if I can do this without exploding in my version of spark (2.4.4)

1 Answer 1

1

For spark >=2.4, you can use the exists function of spark SQL.

df = df.withColumn('flag', F.expr('exists(col1, x -> get_json_object(x, "$.key2") == "val2")')) \
    .filter(F.col('flag')).drop('flag')
df.show(truncate=False)
Sign up to request clarification or add additional context in comments.

2 Comments

Awesome! Worked like a charm. Just one small suggestion, you can just use filter(F.col('flag')).
I have revised the answer based on your suggestion. Can you accept the answer,thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.