Suppose I have a Spark Dataframe generated as:
val df = Seq(
(Array(1, 2, 3), Array("a", "b", "c")),
(Array(1, 2, 3), Array("a", "b", "c"))
).toDF("Col1", "Col2")
It's possible to extract elements at the first index in "Col1" with something like:
val extractFirstInt = udf { (x: Seq[Int], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirstInt($"Col1", lit(1)))
And similarly for the second column "Col2" with e.g.
val extractFirstString = udf { (x: Seq[String], i: Int) => x(i) }
df.withColumn("Col2_1", extractFirstString($"Col2", lit(1)))
But the code duplication is a little ugly -- I need a separate UDF for each underlying element type.
Is there a way to write a generic UDF, that automatically infers the type of the underlying Array in the column of the Spark Dataset? E.g. I'd like to be able to write something like (pseudocode; with generic T)
val extractFirst = udf { (x: Seq[T], i: Int) => x(i) }
df.withColumn("Col1_1", extractFirst($"Col1", lit(1)))
Where somehow the type T would just be automagically inferred by Spark / the Scala compiler (perhaps using reflection if appropriate).
Bonus points if you're aware of a solution that works both with array-columns and Spark's own DenseVector / SparseVector types. The main thing I'd like to avoid (if at all possible) is the requirement of defining a separate UDF for each underlying array-element type I want to handle.
getItemandapply...ArrayTypeobject and, if so, use.getItem()and a UDF (for vectors) otherwise.