2

Schema of dataframe

  root
    |-- parentColumn: array
    |    |-- element: struct
    |    |    |-- colA: string
    |    |    |-- colB: string
    |    |    |-- colTimestamp: string

value inside dataframe look like this

"parentColumn": [
        {
            "colA": "TestA",
            "colB": "TestB",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        },
        {
            "colA": "TestA",
            "colB": "TestB",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        }
    ]

df.withColumn("parentColumn", ?)

Here I want to format all colTimestamp inside the array to UTC format, I saw many examples of updating values inside array but I'm not able to find a way to Update dict inside an array.

2
  • can you share a sample of how your actual df looks like? the sample code you've provided generates 3 distinct columns. Commented Sep 1, 2022 at 10:28
  • Hi @samkart, have updated the schema of dataframe and how data looks inside dataframe. We are reading json from file and creating dataframe so was not sure how replicate the same. Commented Sep 1, 2022 at 10:48

2 Answers 2

4

If you're on spark 3.1+, you can use the transform function with withField within a lambda function.

spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

data_sdf. \
    withColumn('parent_col_new', 
               func.transform('parent_col', 
                              lambda x: x.withField('col_ts', 
                                                    func.to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss")
                                                    )
                              )
               ). \
    show(truncate=False)

# +---+---------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
# |id |parent_col                                                                             |parent_col_new                                                           |
# +---+---------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
# |1  |[{test, testB, 2020-08-17T03:28:44.986000}, {UNREAD, USER, 2020-08-17T03:28:44.986000}]|[{test, testB, 2020-08-17 03:28:44}, {UNREAD, USER, 2020-08-17 03:28:44}]|
# +---+---------------------------------------------------------------------------------------+-------------------------------------------------------------------------+

# root
#  |-- id: integer (nullable = false)
#  |-- parent_col: array (nullable = false)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- col_a: string (nullable = true)
#  |    |    |-- col_b: string (nullable = true)
#  |    |    |-- col_ts: string (nullable = true)
#  |-- parent_col_new: array (nullable = false)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- col_a: string (nullable = true)
#  |    |    |-- col_b: string (nullable = true)
#  |    |    |-- col_ts: timestamp (nullable = true)

If withField and/or transform isn't available in your spark version, you can use expr and recreate the struct. It'll result in the same output.

data_sdf. \
    withColumn('parent_col_new',
               func.expr('''
                         transform(parent_col, 
                                   x -> struct(x.col_a as col_a, 
                                               x.col_b as col_b, 
                                               to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss") as col_ts
                                               )
                                   )
                         ''')
               ). \
    show(truncate=False)
Sign up to request clarification or add additional context in comments.

2 Comments

Hi @samkart, In first example if we want to update other two parameter also how can we do it? for example if I want trim other two column how can it be done?
@Yadav - you can chain withField just like withColumn
2

You can use the transform function to apply a function to each element of the array. Then, since you don't have lots of fields, you can recreate the struct like this:

df.withColumn("parentColumn", transform('parentColumn, x => struct(
    x.getField("colA") as "colA",
    x.getField("colB") as "colB",
    to_utc_timestamp(x.getField("colTimestamp") , "GMT+2") as "colTimestamp"
)))

2 Comments

I'm new to spark, can you explain why there is open single quote before parentColumn.
It simply means column parentColumn. You can use this notation like you use col("parentColumn") or $"parentColumn".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.