1

I have some streaming data, that can minimally be reduced like so:

{
    "data":[
        {
            "key":1,
            "val":"a"
        },
        {
            "key":2,
            "val":"b",
            "test":"bla"
        }
    ]
}

from which I need to access the "data" array which is a string of JSON format. And more specifically I need to find the "val" field in the JSON in it where "key"==2.

So far I have tried:

  • I know that I can access it like this:

    F.get_json_object(...,"$.data[1].val")
    

    but then if the JSON changes the order of objects in the data array, it will no longer work.

  • For JSON I could use:

    F.get_json_object(...,"$.data[?(@.key==2)].val")
    

    but this does not seem to work on Databricks.

  • I tried to dynamically create a struct from JSON string. But "Queries with streaming sources must be executed with writeStream.start()". But I do not want to write the stream anywhere jet since I am still at the preprocessing. Or how could I maybe work around this?

  • I tried to only define the Struct for the array as shown here, but since the elements in the array have varying structure, this does not work.

  • I tried to write a user defined function to access the data object and containing a JSON string which I would then parse like so:

    
    def parse_json(id,idName,keyName,jsonString):
      from json import loads
      data=loads(jsonString)
      res=[d[keyName] for d in data if d[idName]==id]
      return res[0]
    

    and tried to call it with jsonString=F.col("data") where "data" holds the string. But this gives me errors, saying it does not find the attribute I put into the id field.

2
  • Could you elaborate on "varying structure"? how much varying, do you always expect at least key and val exists inside data? Commented Aug 20, 2024 at 14:18
  • @Emma sure! there will always be a key field and my understanding is, that there will always be a val field in which i am interested. There might be a varying number (depending on the key; i.e. for each key there is a different structure) of other fields, in which - for the purpose of the question - i am not interested. however if an aswer is generic and could handle grabing these aswell it would be even better! Commented Aug 20, 2024 at 14:34

2 Answers 2

1

One approach is to convert the stringified JSON to array of struct type then get the value you want.

Even though structure varies, if you have some schema that is somewhat stable, you can build a schema that includes what you want to decompose.

For example, below schema will parse test field if exists, and you will get NULL when it doesn't exist (when key = 1). Also, if you are not interested in test field, you can omit the StructField and test is ignored.

schema = StructType([
    StructField('data', ArrayType(StructType([
        StructField("key", IntegerType()),
        StructField("val", StringType()),
        StructField("test", StringType()),
        # add more field that you are interested in
    ])))
])

Use this schema in from_json then extract the field you want.

df = (df.withColumn('data', F.from_json('data', schema))
      .withColumn('data', F.filter(F.col('data').data, lambda x: x.key == 2)[0].val))

If there are no key = 2, you will get NULL without getting a hard crash.

Sign up to request clarification or add additional context in comments.

2 Comments

This was just the answer I was looking for. What exactly happens, if my schema declares elements, that are not present? These elements the have value None?
yup, that's right.
1

I have tried the below approch:

import json
def get_val_for_key(json_str, key_val):
    data = json.loads(json_str)
    for item in data['data']:
        if item.get('key') == key_val:
            return item.get('val')
    return None
get_val_for_key_udf = udf(get_val_for_key, StringType())
result_df = df.withColumn("desired_val", get_val_for_key_udf(F.col("json_column"), lit(2)))
display(result_df)

Results:

json_column desired_val
{"data":[{"key":1,"val":"a"},{"key":2,"val":"b","test":"bla"}]} b

In the above code parsing the JSON string, dynamically searching for the "val" where "key" == 2, and handling different JSON structures or key orders.

1 Comment

I will wait with accepting for a couple days in case there would be a more native approach... i was hoping there would be. otherwise your approach works perfectly fine!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.