Query JSON data column using Spark DataFrames

Question

I have a dataframe with following structure:

root
 |-- pk: string (nullable = true)
 |-- sk: string (nullable = true)
 |-- tags: string (nullable = true)

Sample data that I am dealing with:

+--------+---------+-----------------------------------------------------------------------------------------------------+
|pk      |sk       |tags                                                                                                 |
+--------+---------+-----------------------------------------------------------------------------------------------------+
|123-8aab|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"},{"type":"xyz","value": "version"}]|
|125-5afs|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"}]                                  |
|562-8608|4c4d-464a|[{"type": "version","value": "2"},{"type":"xyz","value": "version"}]                                 |
|793-439b|4c4d-464a|[{"type": "version","value": "2"}]                                                                   |
+--------+---------+-----------------------------------------------------------------------------------------------------+

The column tags is JSON and I am struggling to get the correct data for the column. What I have so far:

tags_schema = spark.read.json(df_component.select('tags').rdd.map(lambda row: row[0])).schema
df_component = df_component.withColumn('tags', from_json(col('tags'), tags_schema))
df_component.printSchema()

root
 |-- pk: string (nullable = true)
 |-- sk: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- value: string (nullable = true)

After running the above code, most of the values are returning null. Below is the sample data:

+--------+---------+------------+
|pk      |sk       |tags        |
+--------+---------+------------+
|123-8aab|464a-af2f|null        |
|125-5afs|464a-af2f|null        |
|562-8608|4c4d-464a|null        |
|793-439b|4c4d-464a|[version, 2]|
+--------+---------+------------+

Any help would be appreciated.

can you share the output after printing tags_schema and also the output of spark.read.json(df_component.select('tags').rdd.map(lambda row: row)).schema ? — ggordon
– ggordon, Commented Apr 14, 2021 at 2:09

ggordon · Accepted Answer · 2021-04-14 02:07:41Z

1

Your schema identifies tags as a struct however the data inside tags is an array of struct,

Try using the following as the tag schema

from pyspark.sql.types import ArrayType,StructType, StructField, StringType

# from the example you have an array of structs with each struct having type and value
tags_schema = ArrayType(StructType([
    StructField("type",StringType(),True),
    StructField("value",StringType(),True),
],True),True)

df_component = df_component.withColumn("tagdata",from_json(col('tags'), tags_schema))

Debugging conversion

# show original and new column
df_component.show()

answered Apr 14, 2021 at 2:07

ggordon

10.1k2 gold badges18 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Gunjan Khandelwal Over a year ago

Thanks for your reply @ggordon. I also found this blog which was quite helpful as well. It does the same thing. Explicitly asks to define the schema. kontext.tech/column/spark/284/…

Collectives™ on Stack Overflow

Query JSON data column using Spark DataFrames

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related