Use from_json() in Scala to parse multiple Rows in a DataFrame

Question

I've a JSON within a Column of a Spark DataFrame as follows:

ID|           Text|           JSON
------------------------------------------------------------------------------
1|             xyz|          [{"Hour": 1, "Total": 10, "Fail": 1}, {"Hour": 2, "Total": 40, "Fail": 4}, {"Hour": 3, "Total": 20, "Fail": 2}]

I'm using following Schema

val schema = StructType(Array(StructField("Hour", IntegerType),
   StructField("Total", IntegerType), StructField("Fail", IntegerType))

I'm using following code to parse the DataFrame and output the JSON as multiple columns

val newDF = DF.withColumn("JSON", from_json(col("JSON"), schema)).select(col("JSON.*"))
newDF.show()

The above code just parses the one single record from the JSON. But, I want it to parse all the records in the JSON.

The Output is as follows:

Hour|       Total|       Fail|
-------------------------------
   1|          10|          1|
-------------------------------

But, I want the output to be as follows:

Hour|       Total|       Fail|
-------------------------------
   1|          10|          1|
   2|          40|          4|
   3|          20|          2|
-------------------------------

Can Someone, please let me know. What is it that I'm missing !!

Thanks in advance.

Is the original column JSON an array or just plain string ? — philantrovert
– philantrovert, Commented May 24, 2018 at 8:48

Leo C · Accepted Answer · 2018-05-24 03:46:12Z

If I interpret your sample data correctly, your JSON column is a sequence of JSON elements with your posted schema. You'll need to explode the column before applying from_json as follows:

val df = Seq(
  (1, "xyz", Seq("""{"Hour": 1, "Total": 10, "Fail": 1}""",
                 """{"Hour": 2, "Total": 40, "Fail": 4}""",
                 """{"Hour": 3, "Total": 20, "Fail": 2}""")
  )).toDF("ID", "Text", "JSON")

import org.apache.spark.sql.types._

val jsonSchema = StructType(Array(
  StructField("Hour", IntegerType),
  StructField("Total", IntegerType),
  StructField("Fail", IntegerType)
))

df.
  withColumn("JSON", explode(col("JSON"))).
  withColumn("JSON", from_json(col("JSON"), jsonSchema)).
  select("JSON.*").
  show
// +----+-----+----+
// |Hour|Total|Fail|
// +----+-----+----+
// |   1|   10|   1|
// |   2|   40|   4|
// |   3|   20|   2|
// +----+-----+----+

akash Chauhan · Accepted Answer · 2023-07-11 09:48:33Z

0

var df = Seq(
  (1, "xyz",
    """[{"Hour": 1, "Total": 10, "Fail": 1},{"Hour": 2, "Total": 40, "Fail": 4},{"Hour": 3, "Total": 20, "Fail": 2}]"""
  )).toDF("ID", "Text", "JSON1")

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
df.printschema()
import spark.implicits._
var schema_return=spark.read.json(df.select("JSON1").as[String]).schema
 df = df.withColumn("JSON", from_json(col("JSON1"),ArrayType(schema_return)))
  df=df.select(explode(col("JSON")).as("test")).select("test.*")
df.show(truncate=false)

enter image description here

answered Jul 11, 2023 at 9:48

akash Chauhan

111 bronze badge

Collectives™ on Stack Overflow

Use from_json() in Scala to parse multiple Rows in a DataFrame

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related