I have a dataframe with following structure:
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: string (nullable = true)
Sample data that I am dealing with:
+--------+---------+-----------------------------------------------------------------------------------------------------+
|pk |sk |tags |
+--------+---------+-----------------------------------------------------------------------------------------------------+
|123-8aab|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"},{"type":"xyz","value": "version"}]|
|125-5afs|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"}] |
|562-8608|4c4d-464a|[{"type": "version","value": "2"},{"type":"xyz","value": "version"}] |
|793-439b|4c4d-464a|[{"type": "version","value": "2"}] |
+--------+---------+-----------------------------------------------------------------------------------------------------+
The column tags is JSON and I am struggling to get the correct data for the column. What I have so far:
tags_schema = spark.read.json(df_component.select('tags').rdd.map(lambda row: row[0])).schema
df_component = df_component.withColumn('tags', from_json(col('tags'), tags_schema))
df_component.printSchema()
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- value: string (nullable = true)
After running the above code, most of the values are returning null. Below is the sample data:
+--------+---------+------------+
|pk |sk |tags |
+--------+---------+------------+
|123-8aab|464a-af2f|null |
|125-5afs|464a-af2f|null |
|562-8608|4c4d-464a|null |
|793-439b|4c4d-464a|[version, 2]|
+--------+---------+------------+
Any help would be appreciated.
tags_schemaand also the output ofspark.read.json(df_component.select('tags').rdd.map(lambda row: row)).schema?