I have some streaming data, that can minimally be reduced like so:
{
"data":[
{
"key":1,
"val":"a"
},
{
"key":2,
"val":"b",
"test":"bla"
}
]
}
from which I need to access the "data" array which is a string of JSON format. And more specifically I need to find the "val" field in the JSON in it where "key"==2.
So far I have tried:
I know that I can access it like this:
F.get_json_object(...,"$.data[1].val")but then if the JSON changes the order of objects in the
dataarray, it will no longer work.For JSON I could use:
F.get_json_object(...,"$.data[?(@.key==2)].val")but this does not seem to work on Databricks.
I tried to dynamically create a struct from JSON string. But "Queries with streaming sources must be executed with writeStream.start()". But I do not want to write the stream anywhere jet since I am still at the preprocessing. Or how could I maybe work around this?
I tried to only define the Struct for the array as shown here, but since the elements in the array have varying structure, this does not work.
I tried to write a user defined function to access the
dataobject and containing a JSON string which I would then parse like so:def parse_json(id,idName,keyName,jsonString): from json import loads data=loads(jsonString) res=[d[keyName] for d in data if d[idName]==id] return res[0]and tried to call it with
jsonString=F.col("data")where"data"holds the string. But this gives me errors, saying it does not find the attribute I put into theidfield.
keyandvalexists insidedata?keyfield and my understanding is, that there will always be avalfield in which i am interested. There might be a varying number (depending on the key; i.e. for each key there is a different structure) of other fields, in which - for the purpose of the question - i am not interested. however if an aswer is generic and could handle grabing these aswell it would be even better!