9

I want to know how can I "merge" multiple dataframe columns into one as a string array?

For example, I have this dataframe:

val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")

Which looks like this:

scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
|  1|Jack|   125|   Text|
|  2|Mary|   152|  Text2|
+---+----+------+-------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Comment: string (nullable = true)

How can I transform it so it would look like this:

scala> df.show
+---+-----------------+
| Id|             List|
+---+-----------------+
|  1|  [Jack,125,Text]|
|  2| [Mary,152,Text2]|
+---+-----------------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- List: Array (nullable = true)
 |    |-- element: string (containsNull = true)

2 Answers 2

17

Use org.apache.spark.sql.functions.array:

import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")

result.show()
// +---+------------------+
// |Id |List              |
// +---+------------------+
// |1  |[Jack, 125, Text] |
// |2  |[Mary, 152, Text2]|
// +---+------------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. This is the correct answer. But going forward with this, I ran into the next problem. It is not related specifically, so I created a new question. Check it out, maybe you can help me again: stackoverflow.com/questions/41245227/…
nice answer, this helps me A LOT!
I'm doing something like this but how to ignore null values while constructing the array?
@marcia12: I was looking for a similar solution. Found link and used that as the basis to filter out nulls from the array: def NonNullArray = udf(arry: Seq[String]) => if (arry.size >0) { arry.filterNot(_ == null) } else { null })
0

Can also be used with withColumn :

import org.apache.spark.sql.functions as F
   
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.