Writing JSON output with array of objects using Spark

Question

I want to reformat json structure using spark process, into a structure containing array of objects. My input file contain the lines:

{ "keyvals" : [[1,"a"], [2, "b"]] }, 
{ "keyvals" : [[3,"c"], [4, "d"]] }

and I want my process to output

{ "keyvals": [{"id": 1, "value": "a"}, {"id": 2, "value": "c"}] },
{ "keyvals": [{"id": 3, "value": "c"}, {"id": 4, "value": "d"}] }

What's the best way to do that?

For looking at the example input you can run within scala spark-shell:

var jsonStrings = Seq("""{"keyvals": [[1,"a"], [2, "b"]] }""", """{ "keyvals" : [[3,"c"], [4, "d"]] }""") 
var inputRDD = sc.parallelize(jsonStrings)
var df = spark.sqlContext.read.json(inputRDD)
// reformat goes here ?
df.write.json("myfile.json")

thanks

Did you try anything? to_json maybe? Please produce a minimal reproducible example. — philantrovert
– philantrovert, Commented Jun 5, 2018 at 13:25
How would to_json transform [[1,"a"], [2, "b"]] => [{"id": 1, "value": "a"}, {"id": 2, "value": "c"}] ? There's need to be transformation on the data structure. — roee
– roee, Commented Jun 5, 2018 at 13:27

Alper t. Turker · Accepted Answer · 2018-06-05 16:15:01Z

If you check the schema, you'll see that following structure is actually mapped to array<array<string>>

df.printSchema
// root
//  |-- keyvals: array (nullable = true)
//  |    |-- element: array (containsNull = true)
//  |    |    |-- element: string (containsNull = true)

Unless the number of elements is fixed, you'll need an udf:

import org.apache.spark.sql.functions._   

case class Record(id: Long, value: String)

val parse = udf((xs: Seq[Seq[String]]) => xs.map {
  case Seq(id, value) => Record(id.toLong, value)
})


val result = df.select(parse($"keyvals").alias("keyvals"))

and result can be converted toJSON

result.toJSON.toDF("keyvals").show(false)
// +-------------------------------------------------------+
// |keyvals                                                |
// +-------------------------------------------------------+
// |{"keyvals":[{"id":1,"value":"a"},{"id":2,"value":"b"}]}|
// |{"keyvals":[{"id":3,"value":"c"},{"id":4,"value":"d"}]}|
// +-------------------------------------------------------+

or written using JSON writer (result.write.json).

It is also possible to use strongly typed Dataset:

df.as[Seq[Seq[String]]].map { xs => xs.map {
  case Seq(id, value) => Record(id.toLong, value)
}}.toDF("keyvals")

Collectives™ on Stack Overflow

Writing JSON output with array of objects using Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related