Changing JSON objects to JSON lists in Spark with Scala

Question

I have a schema that looks like this:

root
 |-- FirstName: string (nullable = true)
 |-- Age: integer (nullable = true)

I want to change this schema and write this data to a file so it prints out like this:

["Alice", 22],
["Bob", 21],
["Charlie", 23]

As you can see, each line is still valid JSON.

It seems like a data frame always has to have a list of columns. If I do something like:

df.write.json("/path")

Then I always get JSON objects like this:

{"FirstName":"Alice","Age":22}
{"FirstName":"Bob","Age":21}

I think the way to do this is to convert the Spark data frame to a Spark type and then manually construct each string how I want it, but that doesn't seem functional.

Here's what I tried:

val df2 = df.withColumn("NewColumn", array(col("FirstName"), col("Age")))
        .select("First NewColumn")

df2.write.json("./output.json")

Unfortunately, this gave me output like this:

 {"NewColumn":["Alice",22]}
 {"NewColumn":["Bob",21]}

I then tried outputting as text like this:

    val df2 = df.withColumn("NewColumn", concat(
      lit("[\""),
      col("FirstName"),
      lit("\",\""),
      col("Age"),
      lit("\"]")))
      .select(col("NewColumn"))

    df2.write.text("./myFile.txt")

This time, it looks like this:

["Alice","22"]
["Bob","21"]

This is better, but surely I don't have to concatenate literal characters together to get output in this format?

s.polam · Accepted Answer · 2020-10-17 07:09:54Z

Without concat, convert your data to array & then cast it to string or convert it to json using to_json function.

Try Below code.

scala> val df = Seq(("Alice",22),("Bob",21),("Charlie",23)).toDF("firstName","age")

scala> df.show(false)
+---------+---+
|firstName|age|
+---------+---+
|Alice    |22 |
|Bob      |21 |
|Charlie  |23 |
+---------+---+

Casting to String

scala> df
.withColumn("data",array($"firstname",$"age"))
.select($"data".cast("string"))
.repartition(1)
.write
.mode("overwrite")
.text("/tmp/data")

scala> import sys.process._
import sys.process._

scala> "ls -ltr /tmp/data".!
total 4
-rw-r--r-- 1 root root 36 Oct 17 07:02 part-00000-0c92e1d1-0e34-4664-baed-1c4d3d6db10b-c000.txt
-rw-r--r-- 1 root root  0 Oct 17 07:02 _SUCCESS

scala> "cat /tmp/data/part-00000-0c92e1d1-0e34-4664-baed-1c4d3d6db10b-c000.txt".!
[Alice, 22]
[Bob, 21]
[Charlie, 23]

Converting to json string using to_json function.

df
.withColumn("data",array($"firstname",$"age"))
.select(to_json($"data").as("data"))
.repartition(1)
.write
.mode("overwrite")
.text("/tmp/data")

scala> "ls -ltr /tmp/data".!
total 4
-rw-r--r-- 1 root root 45 Oct 17 07:08 part-00000-c2156948-65cc-4d18-8ae1-4d548f06dd3a-c000.txt
-rw-r--r-- 1 root root  0 Oct 17 07:08 _SUCCESS

scala> "cat /tmp/data/part-00000-c2156948-65cc-4d18-8ae1-4d548f06dd3a-c000.txt".!
["Alice","22"]
["Bob","21"]
["Charlie","23"]

Collectives™ on Stack Overflow

Changing JSON objects to JSON lists in Spark with Scala

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related