I was looking at this excellent question so as to improve my Scala skills and the answer: Extract a column value and assign it to another column as an array in spark dataframe
I created my modified code as follows which works, but am left with a few questions:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val uniqueVal = df.select("b").distinct().map(x => x.getAs[Int](0)).collect.toList
def myfun: Int => List[Int] = _ => uniqueVal
def myfun_udf = udf(myfun)
df.withColumn("X", myfun_udf( col("b") )).show
+---+---+---+---------+
| ID| a| b| X|
+---+---+---+---------+
| r1| 1| 1|[1, 4, 2]|
| r2| 6| 4|[1, 4, 2]|
| r3| 4| 1|[1, 4, 2]|
| r4| 1| 2|[1, 4, 2]|
+---+---+---+---------+
It works, but:
- I note b column is put in twice.
- I can also put in column a on the second statement and I get the same result. E.g. and what point is that then?
df.withColumn("X", myfun_udf( col("a") )).show
- If I put in col ID then it gets null.
- So, I am wondering why the second col is input?
- And how this could be made to work generically for all columns?
So, this was code that I looked at elsewhere, but I am missing something.