Add new element to nested array of structs pyspark

Question

I have a dataframe with the following schema using pyspark:

|-- suborders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- trackingStatusHistory: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- trackingStatusUpdatedAt: string (nullable = true)
 |    |    |    |    |-- trackingStatus: string (nullable = true)

What I want to do is create a new deliveredat element for each suborders array using conditions.

I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.

How can I do this using pyspark?

blackbishop · Accepted Answer · 2022-01-07 08:42:59Z

2

You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:

import pyspark.sql.functions as F

df = df.withColumn(
    "suborders",
    F.expr("""transform(
                suborders, 
                x -> struct(
                        filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
                        x.trackingStatusHistory as trackingStatusHistory
                        )
                )
    """)
)

answered Jan 7, 2022 at 8:42

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Christian Over a year ago

Can you help me understand why do you need to use this "expr" function and the triple quotes? I tried to look it up on the internet, but I didn't find anything like it. Also, here in the question I omitted several fields inside suborders and they all disappeared with the transform. Is there an easy solution to keep all fields?

blackbishop Over a year ago

@Christian triple quotes is called "multi-line string". It s used to write string in multiple lines as above. expr is used because before spark 3.1, functions likes transform and filter where not available with DataFrame API. But if you have spark 3.1+, you can use transform and filter with python lambda functions.

blackbishop Over a year ago

@Christian for your second question. to keep the fields you need to recreate the whole inner struct and specify each field to keep. You can see my other posts: here or here

Collectives™ on Stack Overflow

Add new element to nested array of structs pyspark

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related