I'm looking for a help, how to parse json string with multiple keys to json struct, see required output.
Answer below shows how to transform JSON string with one Id :
jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}'How to parse and transform json string from spark data frame rows in pyspark
How to transform thousands of Ids in jstr1, jstr2, when number of Ids per JSON string change in each string.
Current Code:
jstr1 = """
{"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}],
"id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
"""
jstr2 = """
{"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}],
"id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
"id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
"""
schema = "map<string, array<struct<a:int,b:int>>>"
df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
.withColumn('json', F.from_json(F.col('json'), schema))
output = df.withColumn("id", F.map_keys("json").getItem(0)) \
.withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)
Current output:
+-------------------+----+
|json |id |
+-------------------+----+
|[[1, 2], [3, 4]] |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+
Required output:
+---------------------+------+
| json | id |
+---------------------+------+
|[[[1, 2], [3, 4]]] | id_1 |
|[[[5, 6], [7, 8]]] | id_2 |
|[[[9,10], [11,12]]] | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+
# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+