3

I have a PySpark DataFrame with an array of structs, containing two columns (colorcode and name). I want to add a new column to the struct, newcol.

This question answered "how to add a column to a nested struct", but I'm failing to transfer it to my case, where the struct is further nested inside an array. I can't seem to reference/recreate the array-struct schema.

My schema:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)

What is should become:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)
 |    |    |-- newcol: string (nullable = true)

How do I transfer the solution to my nested struct?

Reproducible code to get a df of the above schema:

data = [
    (10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
    (20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
    (30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
    (40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
  ]
myschema = StructType(
[
    StructField("id", IntegerType(), True),
    StructField("values",
                ArrayType(
                    StructType([
                        StructField("Dep", StringType(), True),
                        StructField("ABC", StringType(), True)
                    ])
    ))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)
3
  • Please add some reproducible code Commented Mar 31, 2022 at 6:41
  • what is your spark version? Commented Mar 31, 2022 at 6:49
  • pyspark 3.2.1, I am working on a reproducible code Commented Mar 31, 2022 at 6:49

2 Answers 2

9

For spark version >= 3.1, you can use the transform function and withField method to achieve this.

transform performs the transformation calculation according to the provided function for each element (struct(Dep, ABC) here) in the array (values column here). withField adds/replaces a field in StructType by name.

df = df.withColumn('values', F.transform('values', lambda x: x.withField('newcol', F.lit(1))))
Sign up to request clarification or add additional context in comments.

3 Comments

try: df = df.withColumn('values', F.transform('values', lambda x: x.withField('Dep', x['Dep'].cast('int'))))
This looks neater df.withColumn("values",F.expr("transform(values, x -> struct(cast((x.Dep) as integer) as Dep, x.ABC))"))
It depends on personal habits and familiarity. At the beginning, I was used to answering questions using spark sql expressions, but I found that many people are more used to the dataframe API.
1

Another way, of doing it would be using sql expressions.

df = df.withColumn("values",F.expr("transform(values, x -> struct(COALESCE('1') as newcol,x.Dep,x.ABC))"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.