0

I created a DataFrame as follows:

import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
  (1, List(1,2,3)),
  (1, List(5,7,9)),
  (2, List(4,5,6)),
  (2, List(7,8,9)),
  (2, List(10,11,12)) 
).toDF("id", "list")

val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))
df1.show(false)

Then I tried to convert the WrappedArray row value to string as follows:

import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))

val d = df1.withColumn("col1", arrayToString($"col1"))
d: org.apache.spark.sql.DataFrame = [id: int, col1: string]

scala> d.show(false)
+---+----------------------------+
|id |col1                        |
+---+----------------------------+
|1  |1, 2, 3, 5, 7, 9            |
|2  |4, 5, 6, 7, 8, 9, 10, 11, 12|
+---+----------------------------+

What I really want is to generate an output like the following:

+---+----------------------------+
|id |col1                        |
+---+----------------------------+
|1  |1$2$3, 5$7$ 9               |
|2  |4$5$6, 7$8$9, 10$11$12      |
+---+----------------------------+

How can I achieve this?

1 Answer 1

4

You don't need a udf function, a simple concat_ws should do the trick for you as

import org.apache.spark.sql.functions._
val df1 = df.withColumn("list", concat_ws("$", col("list")))
            .groupBy("id")
            .agg(concat_ws(", ", collect_set($"list")).as("col1"))

df1.show(false)

which should give you

+---+----------------------+
|id |col1                  |
+---+----------------------+
|1  |1$2$3, 5$7$9          |
|2  |7$8$9, 4$5$6, 10$11$12|
+---+----------------------+

As usual, udf function should be avoided if inbuilt functions are available since udf function would require serialization and deserialization of column data to primitive types for calculation and from primitives to columns respectively

even more concise you can avoid the withColumn step as

val df1 = df.groupBy("id")
            .agg(concat_ws(", ", collect_set(concat_ws("$", col("list")))).as("col1"))

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

1 Comment

This is a better approach, I should have considered everything and not only the udf :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.