2

I have a dataframe with the following schema using pyspark:

|-- suborders: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- trackingStatusHistory: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- trackingStatusUpdatedAt: string (nullable = true)
 |    |    |    |    |-- trackingStatus: string (nullable = true)

What I want to do is create a new deliveredat element for each suborders array using conditions.

I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.

How can I do this using pyspark?

1 Answer 1

2

You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:

import pyspark.sql.functions as F

df = df.withColumn(
    "suborders",
    F.expr("""transform(
                suborders, 
                x -> struct(
                        filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
                        x.trackingStatusHistory as trackingStatusHistory
                        )
                )
    """)
)
Sign up to request clarification or add additional context in comments.

3 Comments

Can you help me understand why do you need to use this "expr" function and the triple quotes? I tried to look it up on the internet, but I didn't find anything like it. Also, here in the question I omitted several fields inside suborders and they all disappeared with the transform. Is there an easy solution to keep all fields?
@Christian triple quotes is called "multi-line string". It s used to write string in multiple lines as above. expr is used because before spark 3.1, functions likes transform and filter where not available with DataFrame API. But if you have spark 3.1+, you can use transform and filter with python lambda functions.
@Christian for your second question. to keep the fields you need to recreate the whole inner struct and specify each field to keep. You can see my other posts: here or here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.