I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to append an empty string for that record.
I will explain the method I'm trying to follow.
The dataset can come either looking like this:
{
"id":{
"$oid":1234
},
"orderDelivery":{
"deliveryFee":3500,
"deliveryInstructions":null,
"deliveryIssueIds":null
}
}
OR like that:
{
"id":{
"$oid":1233
},
"orderDelivery":{
"deliveryFee":3500,
"deliveryInstructions":[
{
"createdOn":1635762179390,
"deliveryMessage":"delivered"
},
{
"createdOn":1735762179390,
"deliveryMessage":"not door"
}
],
"deliveryIssueIds":null
}
}
if the deliveryInstructions is null then I want the table to look like this:
| id | createdOn | deliveryMessage |
|---|---|---|
| 1234 |
When the deliveryInstructions is not null then I want it to repeat for that id when I create a table:
| id | createdOn | deliveryMessage |
|---|---|---|
| 1233 | 163576217939 | delivered |
| 1233 | 1735762179390 | not door |
I tried implementing it like this but I didn't win:
if df[deliveryInstructions].isnull():
df_out.withColumn("CreatedOn", "").withColumn("deliveryMessage","")
else:
df2 = df.eplode(deliveryInstructions)
df_out.withColumn("createdOn", df2[CreatedOn])
df_out.withColumn("deliveryMessage", df2[deliveryMessage])
I am reading in df like this:
df= spark.read.option("multiline", "true").json(inputfile)
df_out has been defined like this, there are more fields in json but I am interested in adding the deliveryInstructions to df_out when it is null and when is it not as explained, I just need to add instructions since I find them complex:
df_out = df.select(
df["id"]["$oid"].cast(StringType()).alias("order_id")
)
Can anyone assist?