0

I have a dataframe with following structure:

root
 |-- pk: string (nullable = true)
 |-- sk: string (nullable = true)
 |-- tags: string (nullable = true)

Sample data that I am dealing with:

+--------+---------+-----------------------------------------------------------------------------------------------------+
|pk      |sk       |tags                                                                                                 |
+--------+---------+-----------------------------------------------------------------------------------------------------+
|123-8aab|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"},{"type":"xyz","value": "version"}]|
|125-5afs|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"}]                                  |
|562-8608|4c4d-464a|[{"type": "version","value": "2"},{"type":"xyz","value": "version"}]                                 |
|793-439b|4c4d-464a|[{"type": "version","value": "2"}]                                                                   |
+--------+---------+-----------------------------------------------------------------------------------------------------+

The column tags is JSON and I am struggling to get the correct data for the column. What I have so far:

tags_schema = spark.read.json(df_component.select('tags').rdd.map(lambda row: row[0])).schema
df_component = df_component.withColumn('tags', from_json(col('tags'), tags_schema))
df_component.printSchema()

root
 |-- pk: string (nullable = true)
 |-- sk: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- value: string (nullable = true)

After running the above code, most of the values are returning null. Below is the sample data:

+--------+---------+------------+
|pk      |sk       |tags        |
+--------+---------+------------+
|123-8aab|464a-af2f|null        |
|125-5afs|464a-af2f|null        |
|562-8608|4c4d-464a|null        |
|793-439b|4c4d-464a|[version, 2]|
+--------+---------+------------+

Any help would be appreciated.

1
  • can you share the output after printing tags_schema and also the output of spark.read.json(df_component.select('tags').rdd.map(lambda row: row)).schema ? Commented Apr 14, 2021 at 2:09

1 Answer 1

1

Your schema identifies tags as a struct however the data inside tags is an array of struct,

Try using the following as the tag schema

from pyspark.sql.types import ArrayType,StructType, StructField, StringType

# from the example you have an array of structs with each struct having type and value
tags_schema = ArrayType(StructType([
    StructField("type",StringType(),True),
    StructField("value",StringType(),True),
],True),True)

df_component = df_component.withColumn("tagdata",from_json(col('tags'), tags_schema))

Debugging conversion

# show original and new column
df_component.show()


Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your reply @ggordon. I also found this blog which was quite helpful as well. It does the same thing. Explicitly asks to define the schema. kontext.tech/column/spark/284/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.