I have a spark dataframe (df) with columns - name, id, project, start_date, status
When used to_json function in aggregation, it makes the datatype of payload to be array<string>. How do I convert the array<string> to array<struct<project:string, start_date:date, status: string>>? This conversion is needed to access from redshift spectrum.
df_gp= df.groupBy(F.col('name'),
F.col('id')).agg(F.collect_list(
F.to_json(F.struct(('project'),
('start_date'),
('status')))).alias("payload"))
I followed steps given in, this documentation
import json
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["project"], item["start_date"],item["status"])
json_schema = ArrayType(StructType([StructField('project', StringType(), nullable=True)
, StructField('start_date', DateType(), nullable=True)
, StructField('status', StringType(), nullable=True)]))
udf_parse_json = F.udf(lambda str: parse_json(str), json_schema)
df_new = df_gp.select(df_gp.name, df_gp.id, udf_parse_json(df_gp.payload).alias("payload"))
#works and shows intended schema
df_new.schema
# the following fails
df_new.show(truncate = False)
It throws error:
TypeError: the JSON object must be str, bytes or bytearray, not 'generator'
How do i fix this?