2

I have a table which has a column containing array like this -

Student_ID | Subject_List        | New_Subject
1          | [Mat, Phy, Eng]     | Chem

I want to append the new subject into the subject list and get the new list.

Creating the dataframe -

val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")

I have tried this with UDF as follows -

def append_list = (arr: Seq[String], s: String) => {
    arr :+ s
  }

val append_list_UDF = udf(append_list)

val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))

With UDF, I get the required output

Student_ID | Subject_List        | New_Subject | New_List
1          | [Mat, Phy, Eng]     | Chem        | [Mat, Phy, Eng, Chem]

Can we do it without udf ? Thanks.

2 Answers 2

3

In Spark 2.4 or later a combination of array and concat should do the trick,

import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column

def append(arr: Column, col: Column) = concat(arr, array(col))

df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+                   
|Student_ID|   Subject_List|New_Subject|            New_List|
+----------+---------------+-----------+--------------------+
|         1|[Mat, Phy, Eng]|       Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+

but I wouldn't expect serious performance gains here.

Sign up to request clarification or add additional context in comments.

5 Comments

But, you used udf right, my question was that can we do it without udf ?
No UDFs are defined here. append is just an alias, you can do without it: df.withColumn("New_List", concat($"Subject_List", array($"New_Subject")))
@sachav I tried your method but it gives this error - Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'concat(Subject_List, array(New_Subject))' due to data type mismatch: input to function concat should have StringType or BinaryType, but it's [array<string>, array<string>];
@GouherDanish It means you're using outdated Spark version (2.3 or earlier). In such case udf is the only option.
Yes, you are right, I am using earlier spark version. Thanks for your solution, will keep this in mind. I will approve it.
-1
 val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
  (2, Array("Hindi", "Bio", "Eng"), "IoT"),
  (3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
  .union(df.select($"Student_ID",$"New_Subject"))
  .groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.