I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))