I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe.
Here is the contents of json file.
[{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"},
{"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\"modifiedBy\":\"value14\"}","transactionId": "value15", "tableName": "X1"},
{"jsonString": "{\"uid\":\"value21\",\"adUsername\":\"value23\",\"modifiedBy\":\"value24\"}","transactionId": "value25", "tableName": "X2"}]
I am able to parse contents of string 'jsonString' and select required columns using the below logic
df = spark.read.json('path.json',multiLine=True)
df = df.withColumn('courseCertifications', explode(array(get_json_object(df['jsonString'],'$.courseCertifications'))))
Now my end goal is to parse field "courseType" from "courseCertifications" and create one row per instance.
I am using below logic to get "courseType"
df = df.withColumn('new',get_json_object(df.courseCertifications, '$[*].courseType'))
I am able to get the contents of "courseType" but as a string as shown below
[Row(new=u'["TRAINING","TRAINING"]')]
My end goal is to create a dataframe with columns transactionId, jsonString.uid, jsonString.adUsername, jsonString.courseCertifications.uid, jsonString.courseCertifications.courseType
- I need to retain all the rows and create multiple rows one per array instances of courseCertifications.uid/courseCertifications.courseType.