How to transform Spark Dataframe columns to a single column of a string array

Question

I want to know how can I "merge" multiple dataframe columns into one as a string array?

For example, I have this dataframe:

val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")

Which looks like this:

scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
|  1|Jack|   125|   Text|
|  2|Mary|   152|  Text2|
+---+----+------+-------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Comment: string (nullable = true)

How can I transform it so it would look like this:

scala> df.show
+---+-----------------+
| Id|             List|
+---+-----------------+
|  1|  [Jack,125,Text]|
|  2| [Mary,152,Text2]|
+---+-----------------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- List: Array (nullable = true)
 |    |-- element: string (containsNull = true)

Tzach Zohar · Accepted Answer · 2016-12-07 15:43:44Z

17

Use org.apache.spark.sql.functions.array:

import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")

result.show()
// +---+------------------+
// |Id |List              |
// +---+------------------+
// |1  |[Jack, 125, Text] |
// |2  |[Mary, 152, Text2]|
// +---+------------------+

answered Dec 7, 2016 at 15:43

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

V. Samma Over a year ago

Thanks. This is the correct answer. But going forward with this, I ran into the next problem. It is not related specifically, so I created a new question. Check it out, maybe you can help me again: stackoverflow.com/questions/41245227/…

Claudio Cavalcante Over a year ago

nice answer, this helps me A LOT!

marcia12 Over a year ago

I'm doing something like this but how to ignore null values while constructing the array?

B. Griffiths Over a year ago

@marcia12: I was looking for a similar solution. Found link and used that as the basis to filter out nulls from the array: def NonNullArray = udf(arry: Seq[String]) => if (arry.size >0) { arry.filterNot(_ == null) } else { null })

ZettaP · Accepted Answer · 2022-01-19 15:23:14Z

0

Can also be used with withColumn :

import org.apache.spark.sql.functions as F
   
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))

answered Jan 19, 2022 at 15:23

ZettaP

1,49412 silver badges15 bronze badges

Collectives™ on Stack Overflow

How to transform Spark Dataframe columns to a single column of a string array

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related