How to append a string column to array string column in Scala Spark without using UDF?

Question

I have a table which has a column containing array like this -

Student_ID | Subject_List        | New_Subject
1          | [Mat, Phy, Eng]     | Chem

I want to append the new subject into the subject list and get the new list.

Creating the dataframe -

val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")

I have tried this with UDF as follows -

def append_list = (arr: Seq[String], s: String) => {
    arr :+ s
  }

val append_list_UDF = udf(append_list)

val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))

With UDF, I get the required output

Student_ID | Subject_List        | New_Subject | New_List
1          | [Mat, Phy, Eng]     | Chem        | [Mat, Phy, Eng, Chem]

Can we do it without udf ? Thanks.

user10938362 · Accepted Answer · 2019-05-20 11:39:10Z

3

In Spark 2.4 or later a combination of array and concat should do the trick,

import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column

def append(arr: Column, col: Column) = concat(arr, array(col))

df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show

+----------+---------------+-----------+--------------------+                   
|Student_ID|   Subject_List|New_Subject|            New_List|
+----------+---------------+-----------+--------------------+
|         1|[Mat, Phy, Eng]|       Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+

but I wouldn't expect serious performance gains here.

edited May 20, 2019 at 11:39

answered May 18, 2019 at 16:02

user10938362

4,1822 gold badges15 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

GouherDanish Over a year ago

But, you used udf right, my question was that can we do it without udf ?

sachav Over a year ago

No UDFs are defined here. append is just an alias, you can do without it: df.withColumn("New_List", concat($"Subject_List", array($"New_Subject")))

GouherDanish Over a year ago

@sachav I tried your method but it gives this error - Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'concat(Subject_List, array(New_Subject))' due to data type mismatch: input to function concat should have StringType or BinaryType, but it's [array<string>, array<string>];

user10938362 Over a year ago

@GouherDanish It means you're using outdated Spark version (2.3 or earlier). In such case udf is the only option.

GouherDanish Over a year ago

Yes, you are right, I am using earlier spark version. Thanks for your solution, will keep this in mind. I will approve it.

Raghavendra Singh · Accepted Answer · 2019-05-20 13:59:02Z

-1

 val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
  (2, Array("Hindi", "Bio", "Eng"), "IoT"),
  (3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
  .union(df.select($"Student_ID",$"New_Subject"))
  .groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]

edited May 20, 2019 at 13:59

answered May 20, 2019 at 13:48

Raghavendra Singh

11 bronze badge

Collectives™ on Stack Overflow

How to append a string column to array string column in Scala Spark without using UDF?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related