0

I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to append an empty string for that record.

I will explain the method I'm trying to follow.

The dataset can come either looking like this:

{
"id":{
    "$oid":1234
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":null,
    "deliveryIssueIds":null
   }
 }

OR like that:

  {
"id":{
    "$oid":1233
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":[
        {
            "createdOn":1635762179390,
            "deliveryMessage":"delivered"
        },
        {
            "createdOn":1735762179390,
            "deliveryMessage":"not door"
        }
    ],
    "deliveryIssueIds":null
  }
}

if the deliveryInstructions is null then I want the table to look like this:

id createdOn deliveryMessage
1234

When the deliveryInstructions is not null then I want it to repeat for that id when I create a table:

id createdOn deliveryMessage
1233 163576217939 delivered
1233 1735762179390 not door

I tried implementing it like this but I didn't win:

if df[deliveryInstructions].isnull():
    df_out.withColumn("CreatedOn", "").withColumn("deliveryMessage","")
else:
    df2 = df.eplode(deliveryInstructions)
    df_out.withColumn("createdOn", df2[CreatedOn])
    df_out.withColumn("deliveryMessage", df2[deliveryMessage])

I am reading in df like this:

df= spark.read.option("multiline", "true").json(inputfile)

df_out has been defined like this, there are more fields in json but I am interested in adding the deliveryInstructions to df_out when it is null and when is it not as explained, I just need to add instructions since I find them complex:

    df_out = df.select(

      df["id"]["$oid"].cast(StringType()).alias("order_id")
)

Can anyone assist?

1 Answer 1

1
a.json
{"id":{    "$oid":1234},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":null,    "deliveryIssueIds":null}}
{"id":{    "$oid":1233},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":[        {            "createdOn":1635762179390,            "deliveryMessage":"delivered"        },        {            "createdOn":1735762179390,            "deliveryMessage":"not door"        }    ],    "deliveryIssueIds":null  }}
select nested elements
from pyspark.sql import functions as F

(df
    .select(
        F.col('id.$oid').alias('id'),
        F.col('orderDelivery.deliveryFee'),
        # I used explode_outer instead of explode to keep the nulls
        F.explode_outer(F.col('orderDelivery.deliveryInstructions')).alias('deliveryInstructions'), 
        F.col('orderDelivery.deliveryIssueIds'),
    )
    .select(
        F.col('id'),
        F.col('deliveryFee'),
        F.col('deliveryInstructions.*'),
        F.col('deliveryIssueIds'),
    )
    .show(10, False)
)

+----+-----------+-------------+---------------+----------------+
|id  |deliveryFee|createdOn    |deliveryMessage|deliveryIssueIds|
+----+-----------+-------------+---------------+----------------+
|1234|3500       |null         |null           |null            |
|1233|3500       |1635762179390|delivered      |null            |
|1233|3500       |1735762179390|not door       |null            |
+----+-----------+-------------+---------------+----------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.