Spark: How to transform JSON string with multiple keys, from data frame rows?

Question

I'm looking for a help, how to parse json string with multiple keys to json struct, see required output.

Answer below shows how to transform JSON string with one Id :

jstr1 = '{"id_1": \[{"a": 1, "b": 2}, {"a": 3, "b": 4}\]}'
How to parse and transform json string from spark data frame rows in pyspark

How to transform thousands of Ids in jstr1, jstr2, when number of Ids per JSON string change in each string.

Current Code:

jstr1 = """
        {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], 
        "id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
              """
jstr2 = """
        {"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}], 
         "id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
         "id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
          """

schema = "map<string, array<struct<a:int,b:int>>>"

df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
    .withColumn('json', F.from_json(F.col('json'), schema))

output = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)

Current output:

+-------------------+----+
|json               |id  |
+-------------------+----+
|[[1, 2], [3, 4]]   |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+

Required output:

+---------------------+------+
|         json        |  id  |
+---------------------+------+
|[[[1, 2], [3, 4]]]   | id_1 |
|[[[5, 6], [7, 8]]]   | id_2 |
|[[[9,10], [11,12]]]  | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+

# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
                      ...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+

mck · Accepted Answer · 2021-01-30 17:16:35Z

An explode of the map column will do the job:

import pyspark.sql.functions as F

df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
|  id|                json|
+----+--------------------+
|id_1|    [[1, 2], [3, 4]]|
|id_2|    [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+

To achieve the other desired output in your previous question, you can explode one more time. This time you explode the array column, which came from the value of the map.

df.select(
    F.explode('json').alias('id', 'json')
).select(
    'id', F.explode('json').alias('json')
).select(
    'id', 'json.*'
).show()
+----+---+---+
|  id|  a|  b|
+----+---+---+
|id_1|  1|  2|
|id_1|  3|  4|
|id_2|  5|  6|
|id_2|  7|  8|
|id_3|  9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+

Collectives™ on Stack Overflow

Spark: How to transform JSON string with multiple keys, from data frame rows?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related