Pyspark: Extract Json Objects from Array

Question

I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to append an empty string for that record.

I will explain the method I'm trying to follow.

The dataset can come either looking like this:

{
"id":{
    "$oid":1234
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":null,
    "deliveryIssueIds":null
   }
 }

OR like that:

  {
"id":{
    "$oid":1233
},
"orderDelivery":{
    "deliveryFee":3500,
    "deliveryInstructions":[
        {
            "createdOn":1635762179390,
            "deliveryMessage":"delivered"
        },
        {
            "createdOn":1735762179390,
            "deliveryMessage":"not door"
        }
    ],
    "deliveryIssueIds":null
  }
}

if the deliveryInstructions is null then I want the table to look like this:

id	createdOn	deliveryMessage
1234

When the deliveryInstructions is not null then I want it to repeat for that id when I create a table:

id	createdOn	deliveryMessage
1233	163576217939	delivered
1233	1735762179390	not door

I tried implementing it like this but I didn't win:

if df[deliveryInstructions].isnull():
    df_out.withColumn("CreatedOn", "").withColumn("deliveryMessage","")
else:
    df2 = df.eplode(deliveryInstructions)
    df_out.withColumn("createdOn", df2[CreatedOn])
    df_out.withColumn("deliveryMessage", df2[deliveryMessage])

I am reading in df like this:

df= spark.read.option("multiline", "true").json(inputfile)

df_out has been defined like this, there are more fields in json but I am interested in adding the deliveryInstructions to df_out when it is null and when is it not as explained, I just need to add instructions since I find them complex:

    df_out = df.select(

      df["id"]["$oid"].cast(StringType()).alias("order_id")
)

Can anyone assist?

pltc · Accepted Answer · 2022-04-23 03:31:04Z

a.json

{"id":{    "$oid":1234},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":null,    "deliveryIssueIds":null}}
{"id":{    "$oid":1233},"orderDelivery":{    "deliveryFee":3500,    "deliveryInstructions":[        {            "createdOn":1635762179390,            "deliveryMessage":"delivered"        },        {            "createdOn":1735762179390,            "deliveryMessage":"not door"        }    ],    "deliveryIssueIds":null  }}

select nested elements

from pyspark.sql import functions as F

(df
    .select(
        F.col('id.$oid').alias('id'),
        F.col('orderDelivery.deliveryFee'),
        # I used explode_outer instead of explode to keep the nulls
        F.explode_outer(F.col('orderDelivery.deliveryInstructions')).alias('deliveryInstructions'), 
        F.col('orderDelivery.deliveryIssueIds'),
    )
    .select(
        F.col('id'),
        F.col('deliveryFee'),
        F.col('deliveryInstructions.*'),
        F.col('deliveryIssueIds'),
    )
    .show(10, False)
)

+----+-----------+-------------+---------------+----------------+
|id  |deliveryFee|createdOn    |deliveryMessage|deliveryIssueIds|
+----+-----------+-------------+---------------+----------------+
|1234|3500       |null         |null           |null            |
|1233|3500       |1635762179390|delivered      |null            |
|1233|3500       |1735762179390|not door       |null            |
+----+-----------+-------------+---------------+----------------+

Collectives™ on Stack Overflow

Pyspark: Extract Json Objects from Array

1 Answer 1

a.json

select nested elements

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

a.json

select nested elements

Comments

Your Answer

Sign up or log in

Post as a guest

Related