0

I'm looking for a help, how to parse json string with multiple keys to json struct, see required output.

Answer below shows how to transform JSON string with one Id :

How to transform thousands of Ids in jstr1, jstr2, when number of Ids per JSON string change in each string.

Current Code:

jstr1 = """
        {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], 
        "id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
              """
jstr2 = """
        {"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}], 
         "id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
         "id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
          """

schema = "map<string, array<struct<a:int,b:int>>>"

df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
    .withColumn('json', F.from_json(F.col('json'), schema))

output = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)

Current output:

+-------------------+----+
|json               |id  |
+-------------------+----+
|[[1, 2], [3, 4]]   |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+

Required output:

+---------------------+------+
|         json        |  id  |
+---------------------+------+
|[[[1, 2], [3, 4]]]   | id_1 |
|[[[5, 6], [7, 8]]]   | id_2 |
|[[[9,10], [11,12]]]  | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+

# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
                      ...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+ 

1 Answer 1

2

An explode of the map column will do the job:

import pyspark.sql.functions as F

df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
|  id|                json|
+----+--------------------+
|id_1|    [[1, 2], [3, 4]]|
|id_2|    [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+

To achieve the other desired output in your previous question, you can explode one more time. This time you explode the array column, which came from the value of the map.

df.select(
    F.explode('json').alias('id', 'json')
).select(
    'id', F.explode('json').alias('json')
).select(
    'id', 'json.*'
).show()
+----+---+---+
|  id|  a|  b|
+----+---+---+
|id_1|  1|  2|
|id_1|  3|  4|
|id_2|  5|  6|
|id_2|  7|  8|
|id_3|  9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.