1

I have a schema that looks like this:

root
 |-- FirstName: string (nullable = true)
 |-- Age: integer (nullable = true)

I want to change this schema and write this data to a file so it prints out like this:

["Alice", 22],
["Bob", 21],
["Charlie", 23]

As you can see, each line is still valid JSON.

It seems like a data frame always has to have a list of columns. If I do something like:

df.write.json("/path")

Then I always get JSON objects like this:

{"FirstName":"Alice","Age":22}
{"FirstName":"Bob","Age":21}

I think the way to do this is to convert the Spark data frame to a Spark type and then manually construct each string how I want it, but that doesn't seem functional.

Here's what I tried:

val df2 = df.withColumn("NewColumn", array(col("FirstName"), col("Age")))
        .select("First NewColumn")

df2.write.json("./output.json")

Unfortunately, this gave me output like this:

 {"NewColumn":["Alice",22]}
 {"NewColumn":["Bob",21]}

I then tried outputting as text like this:

    val df2 = df.withColumn("NewColumn", concat(
      lit("[\""),
      col("FirstName"),
      lit("\",\""),
      col("Age"),
      lit("\"]")))
      .select(col("NewColumn"))

    df2.write.text("./myFile.txt")

This time, it looks like this:

["Alice","22"]
["Bob","21"]

This is better, but surely I don't have to concatenate literal characters together to get output in this format?

1 Answer 1

2

Without concat, convert your data to array & then cast it to string or convert it to json using to_json function.

Try Below code.

scala> val df = Seq(("Alice",22),("Bob",21),("Charlie",23)).toDF("firstName","age")
scala> df.show(false)
+---------+---+
|firstName|age|
+---------+---+
|Alice    |22 |
|Bob      |21 |
|Charlie  |23 |
+---------+---+

Casting to String

scala> df
.withColumn("data",array($"firstname",$"age"))
.select($"data".cast("string"))
.repartition(1)
.write
.mode("overwrite")
.text("/tmp/data")
scala> import sys.process._
import sys.process._

scala> "ls -ltr /tmp/data".!
total 4
-rw-r--r-- 1 root root 36 Oct 17 07:02 part-00000-0c92e1d1-0e34-4664-baed-1c4d3d6db10b-c000.txt
-rw-r--r-- 1 root root  0 Oct 17 07:02 _SUCCESS

scala> "cat /tmp/data/part-00000-0c92e1d1-0e34-4664-baed-1c4d3d6db10b-c000.txt".!
[Alice, 22]
[Bob, 21]
[Charlie, 23]

Converting to json string using to_json function.

df
.withColumn("data",array($"firstname",$"age"))
.select(to_json($"data").as("data"))
.repartition(1)
.write
.mode("overwrite")
.text("/tmp/data")
scala> "ls -ltr /tmp/data".!
total 4
-rw-r--r-- 1 root root 45 Oct 17 07:08 part-00000-c2156948-65cc-4d18-8ae1-4d548f06dd3a-c000.txt
-rw-r--r-- 1 root root  0 Oct 17 07:08 _SUCCESS
scala> "cat /tmp/data/part-00000-c2156948-65cc-4d18-8ae1-4d548f06dd3a-c000.txt".!
["Alice","22"]
["Bob","21"]
["Charlie","23"]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.