pyspark: Converting string to struct

Question

I have data as follows -

{
    "Id": "01d3050e",
    "Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",
    "LastUpdated": 1581530000000,
    "LastUpdatedBy": "System"
}

Using aws glue, I want to relationalize the "Properties" column but since the datatype is string it can't be done. Converting it to struct, might do it based on reading this blog -

https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/

>>> df.show
<bound method DataFrame.show of DataFrame[Id: string, LastUpdated: bigint, LastUpdatedBy: string, Properties: string]>
>>> df.show()
+--------+-------------+-------------+--------------------+
|      Id|  LastUpdated|LastUpdatedBy|          Properties|
+--------+-------------+-------------+--------------------+
|01d3050e|1581530000000|       System|{"choices":null,"...|
+--------+-------------+-------------+--------------------+

How can I un-nested the "properties" column to break it into "choices", "object", "database" and "timestamp" columns, using relationalize transformer or any UDF in pyspark.

Tried it, but doesn;t seem to help. ` >>> sparkdf.printSchema() root |-- Id: string |-- LastUpdated: long |-- LastUpdatedBy: string |-- Properties: string >>> sdfc = UnnestFrame.apply(frame=sparkdf) >>> sdfc.show() {"Id": "01d3050e", "LastUpdated": 1581530000000, "LastUpdatedBy": "System", "Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"demodb\",\"timestamp\":\"1581534117303\"}"} >>> sdfc.printSchema() root |-- Id: string |-- LastUpdated: long |-- LastUpdatedBy: string |-- Properties: string ` — Anand
– Anand, Commented Feb 19, 2020 at 6:07

blackbishop · Accepted Answer · 2020-02-19 08:52:36Z

Use from_json since the column Properties is a JSON string.

If the schema is the same for all you records you can convert to a struct type by defining the schema like this:

schema = StructType([StructField("choices", StringType(), True),
                    StructField("object", StringType(), True),
                    StructField("database", StringType(), True),
                    StructField("timestamp", StringType(), True)],
                    )

df.withColumn("Properties", from_json(col("Properties"), schema)).show(truncate=False)

#+--------+-------------+-------------+---------------------------+
#|Id      |LastUpdated  |LastUpdatedBy|Properties                 |
#+--------+-------------+-------------+---------------------------+
#|01d3050e|1581530000000|System       |[, demo, pg, 1581534117303]|
#+--------+-------------+-------------+---------------------------+

However, if the schema can change from one row to another I'd suggest you to convert it to a Map type instead:

df.withColumn("Properties", from_json(col("Properties"), MapType(StringType(), StringType()))).show(truncate=False)

#+--------+-------------+-------------+------------------------------------------------------------------------+
#|Id      |LastUpdated  |LastUpdatedBy|Properties                                                              |
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|01d3050e|1581530000000|System       |[choices ->, object -> demo, database -> pg, timestamp -> 1581534117303]|
#+--------+-------------+-------------+------------------------------------------------------------------------+

You can then access elements of the map using element_at (Spark 2.4+)

Community · Accepted Answer · 2020-06-20 09:12:55Z

Creating your dataframe:

from pyspark.sql import functions as F
list=[["01d3050e","{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",1581530000000,"System"]]
df=spark.createDataFrame(list, ['Id','Properties','LastUpdated','LastUpdatedBy'])
df.show(truncate=False)

+--------+----------------------------------------------------------------------------+-------------+-------------+
|Id      |Properties                                                                  |LastUpdated  |LastUpdatedBy|
+--------+----------------------------------------------------------------------------+-------------+-------------+
|01d3050e|{"choices":null,"object":"demo","database":"pg","timestamp":"1581534117303"}|1581530000000|System       |
+--------+----------------------------------------------------------------------------+-------------+-------------+

Use inbuilt regex, split, and element_at:

No need to use UDF, inbuilt functions are adequate and very much optimized for big data tasks.

df.withColumn("Properties", F.split(F.regexp_replace(F.regexp_replace((F.regexp_replace("Properties",'\{|}',"")),'\:',','),'\"|"',"").cast("string"),','))\
.withColumn("choices", F.element_at("Properties",2))\
.withColumn("object", F.element_at("Properties",4))\
.withColumn("database",F.element_at("Properties",6))\
.withColumn("timestamp",F.element_at("Properties",8).cast('long')).drop("Properties").show()


+--------+-------------+-------------+-------+------+--------+-------------+
|      Id|  LastUpdated|LastUpdatedBy|choices|object|database|    timestamp|
+--------+-------------+-------------+-------+------+--------+-------------+
|01d3050e|1581530000000|       System|   null|  demo|      pg|1581534117303|
+--------+-------------+-------------+-------+------+--------+-------------+


root
 |-- Id: string (nullable = true)
 |-- LastUpdated: long (nullable = true)
 |-- LastUpdatedBy: string (nullable = true)
 |-- choices: string (nullable = true)
 |-- object: string (nullable = true)
 |-- database: string (nullable = true)
 |-- timestamp: long (nullable = true)

Anand · Accepted Answer · 2020-02-20 23:24:46Z

1

Since I was using AWS Glue service, I ended up using the "Unbox" class to Unboxe the string field in dynamicFrame. Worked well for my use-case.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

unbox = Unbox.apply(frame = dynamic_dframe, path = "Properties", format="json")

answered Feb 20, 2020 at 23:24

Anand

1851 gold badge3 silver badges12 bronze badges

Collectives™ on Stack Overflow

pyspark: Converting string to struct

3 Answers 3

Comments

Creating your dataframe:

Use inbuilt regex, split, and element_at:

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Creating your dataframe:

Use inbuilt regex, split, and element_at:

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related