4

I've a JSON within a Column of a Spark DataFrame as follows:

ID|           Text|           JSON
------------------------------------------------------------------------------
1|             xyz|          [{"Hour": 1, "Total": 10, "Fail": 1}, {"Hour": 2, "Total": 40, "Fail": 4}, {"Hour": 3, "Total": 20, "Fail": 2}]

I'm using following Schema

val schema = StructType(Array(StructField("Hour", IntegerType),
   StructField("Total", IntegerType), StructField("Fail", IntegerType))

I'm using following code to parse the DataFrame and output the JSON as multiple columns

val newDF = DF.withColumn("JSON", from_json(col("JSON"), schema)).select(col("JSON.*"))
newDF.show()

The above code just parses the one single record from the JSON. But, I want it to parse all the records in the JSON.

The Output is as follows:

Hour|       Total|       Fail|
-------------------------------
   1|          10|          1|
-------------------------------

But, I want the output to be as follows:

Hour|       Total|       Fail|
-------------------------------
   1|          10|          1|
   2|          40|          4|
   3|          20|          2|
-------------------------------

Can Someone, please let me know. What is it that I'm missing !!

Thanks in advance.

1
  • Is the original column JSON an array or just plain string ? Commented May 24, 2018 at 8:48

2 Answers 2

2

If I interpret your sample data correctly, your JSON column is a sequence of JSON elements with your posted schema. You'll need to explode the column before applying from_json as follows:

val df = Seq(
  (1, "xyz", Seq("""{"Hour": 1, "Total": 10, "Fail": 1}""",
                 """{"Hour": 2, "Total": 40, "Fail": 4}""",
                 """{"Hour": 3, "Total": 20, "Fail": 2}""")
  )).toDF("ID", "Text", "JSON")

import org.apache.spark.sql.types._

val jsonSchema = StructType(Array(
  StructField("Hour", IntegerType),
  StructField("Total", IntegerType),
  StructField("Fail", IntegerType)
))

df.
  withColumn("JSON", explode(col("JSON"))).
  withColumn("JSON", from_json(col("JSON"), jsonSchema)).
  select("JSON.*").
  show
// +----+-----+----+
// |Hour|Total|Fail|
// +----+-----+----+
// |   1|   10|   1|
// |   2|   40|   4|
// |   3|   20|   2|
// +----+-----+----+
Sign up to request clarification or add additional context in comments.

Comments

0
var df = Seq(
  (1, "xyz",
    """[{"Hour": 1, "Total": 10, "Fail": 1},{"Hour": 2, "Total": 40, "Fail": 4},{"Hour": 3, "Total": 20, "Fail": 2}]"""
  )).toDF("ID", "Text", "JSON1")

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
df.printschema()
import spark.implicits._
var schema_return=spark.read.json(df.select("JSON1").as[String]).schema
 df = df.withColumn("JSON", from_json(col("JSON1"),ArrayType(schema_return)))
  df=df.select(explode(col("JSON")).as("test")).select("test.*")
df.show(truncate=false)

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.