How to convert array<string> to array<struct> using Pyspark?

Question

I have a spark dataframe (df) with columns - name, id, project, start_date, status

When used to_json function in aggregation, it makes the datatype of payload to be array<string>. How do I convert the array<string> to array<struct<project:string, start_date:date, status: string>>? This conversion is needed to access from redshift spectrum.

df_gp= df.groupBy(F.col('name'),
                          F.col('id')).agg(F.collect_list(
                          F.to_json(F.struct(('project'),
                                             ('start_date'),
                                             ('status')))).alias("payload"))

I followed steps given in, this documentation

import json
def parse_json(array_str):
    json_obj = json.loads(array_str)
    for item in json_obj: 
        yield (item["project"], item["start_date"],item["status"])

json_schema = ArrayType(StructType([StructField('project', StringType(), nullable=True)
, StructField('start_date', DateType(), nullable=True)
, StructField('status', StringType(), nullable=True)]))

udf_parse_json = F.udf(lambda str: parse_json(str), json_schema)
df_new = df_gp.select(df_gp.name, df_gp.id, udf_parse_json(df_gp.payload).alias("payload"))

#works and shows intended schema
df_new.schema

# the following fails
df_new.show(truncate = False)

It throws error:

TypeError: the JSON object must be str, bytes or bytearray, not 'generator'

How do i fix this?

murtihash · Accepted Answer · 2020-07-21 20:02:15Z

2

You don't need to_json in your aggregation, it works fine without it.

df.groupBy(F.col('name'),F.col('id')).agg(F.collect_list(
                          F.struct(('project'),
                                             ('start_date'),
                                             ('status'))).alias("payload")).printSchema()


#root
# |-- name: string (nullable = true)
# |-- id: long (nullable = true)
# |-- payload: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- project: string (nullable = true)
# |    |    |-- start_date: date (nullable = true)
# |    |    |-- status: string (nullable = true)

answered Jul 21, 2020 at 20:02

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to convert array<string> to array<struct> using Pyspark?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related