dataframe Spark scala explode json array

Question

Let's say I have a dataframe which looks like this:

+--------------------+--------------------+--------------------------------------------------------------+
|                id  |           Name     |                                                       Payment|
+--------------------+--------------------+--------------------------------------------------------------+
|                1   |           James    |[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|
+--------------------+--------------------+--------------------------------------------------------------+

And the schema is:

root

|-- id: integer (nullable = true)
|-- Name: string (nullable = true)   
|-- Payment: string (nullable = true)

How can I explode the above JSON array into below:

+--------------------+--------------------+-------------------------------+
|                id  |           Name     |                        Payment|
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":1, "currency":"GBP"} |
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":2, "currency":"USD"} |
+--------------------+--------------------+-------------------------------+

I've been trying to use the explode functionality like the below, but it's not working. It's giving an error about not being able to explode string types, and that it expects either a map or array. This makes sense given the schema denotes it's a string, rather than an array/map, but I'm not sure how to convert this into an appropriate format.

val newDF = dataframe.withColumn("nestedPayment", explode(dataframe.col("Payment")))

Any help is greatly appreciated!

Tzach Zohar · Accepted Answer · 2017-03-16 21:08:57Z

11

You'll have to parse the JSON string into an array of JSONs, and then use explode on the result (explode expects an array).

To do that (assuming Spark 2.0.*):

If you know all Payment values contain a json representing an array with the same size (e.g. 2 in this case), you can hard-code extraction of the first and second elements, wrap them in an array and explode:
```
val newDF = dataframe.withColumn("Payment", explode(array(
  get_json_object($"Payment", "$[0]"),
  get_json_object($"Payment", "$[1]")
)))
```

If you can't guarantee all records have a JSON with a 2-element array, but you can guarantee a maximum length of these arrays, you can use this trick to parse elements up to the maximum size and then filter out the resulting nulls:

val maxJsonParts = 3 // whatever that number is...
val jsonElements = (0 until maxJsonParts)
                     .map(i => get_json_object($"Payment", s"$$[$i]"))

val newDF = dataframe
  .withColumn("Payment", explode(array(jsonElements: _*)))
  .where(!isnull($"Payment"))

answered Mar 16, 2017 at 21:08

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Paul Over a year ago

is there a way to do this with a while loop? Seems like it would be more efficient

Tzach Zohar Over a year ago

The supposed performance improvement achieved by a while loop would be so small it's probably unmeasurable. This being a Spark application, one can assume that runtime is dominated by the actual DataFrame operations and not the driver-side code that builds them. Such "premature optimizations" only make code harder to read.

Mantovani Over a year ago

Hello, if I don't know the max length of my array. How can I do something like: val jsonElements = (0 until arrayLength) .map(i => get_json_object($"Payment", s"$$[$i]")) ?

Sugyan sahu Over a year ago

@TzachZohar how can we calculate the size of the json array using get_json_object(), I tried get_json_object(col("col_name"), "$.length()"), it didn't work and gives null

Boken · Accepted Answer · 2020-04-21 13:41:58Z

3

import org.apache.spark.sql.types._

val newDF = dataframe.withColumn("Payment", 
explode(
from_json(
  get_json_object($"Payment", "$."),ArrayType(StringType)
)))

edited Apr 21, 2020 at 13:41

Boken

5,43810 gold badges41 silver badges47 bronze badges

answered Apr 21, 2020 at 11:48

Kudakwashe Nyatsanza

411 bronze badge

1 Comment

Kudakwashe Nyatsanza Over a year ago

Instead of just downvoting, please leave a comment so I know what's wrong with my answer. This is my first post and it's quite discouraging when all you want to do is help. Thank you.

Apollo · Accepted Answer · 2022-04-20 06:58:34Z

My solution is wrap your json array string into a json string to use from_json function with struct type of array of string

val dataframe = spark.sparkContext.parallelize(Seq(("1", "James", "[ {\"@id\": 1, \"currency\":\"GBP\"},{\"@id\": 2, \"currency\": \"USD\"} ]"))).toDF("id", "Name", "Payment")
val result = dataframe.withColumn("wrapped_json", concat_ws("", lit("{\"array\":"), col("Payment"), lit("}")))
    .withColumn("array_json", from_json(col("wrapped_json"), StructType(Seq(StructField("array", ArrayType(StringType))))))
    .withColumn("result", explode(col("array_json.array")))

Result:

+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+
|id |Name |Payment                                                       |wrapped_json                                                            |array_json                                                |result                    |
+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+
|1  |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":1,"currency":"GBP"}|
|1  |James|[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|{"array":[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]}|[[{"@id":1,"currency":"GBP"}, {"@id":2,"currency":"USD"}]]|{"@id":2,"currency":"USD"}|
+---+-----+--------------------------------------------------------------+------------------------------------------------------------------------+----------------------------------------------------------+--------------------------+

I am using spark 2.3.2 and Kudakwashe Nyatsanza's solution not work for me, It throw org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(value)' due to data type mismatch: Input schema array<string> must be a struct or an array of structs.

ar_ · Accepted Answer · 2021-06-04 16:51:13Z

0

You can define the schema of the Payment json array using ArrayType.

import org.apache.spark.sql.types._

val paymentSchema = ArrayType(StructType(
                  Array(
                        StructField("@id", DataTypes.IntegerType),
                        StructField("currency", DataTypes.StringType)
                  )
))

Then exploding after using from_json with this schema will return the desired result.

val newDF = dataframe.withColumn("Payment", explode(from_json($"Payment", paymentSchema)))

answered Jun 4, 2021 at 16:51

ar_

111 bronze badge

Collectives™ on Stack Overflow

dataframe Spark scala explode json array

4 Answers 4

4 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related