I am trying to break a string (technically strings passed from a column in a dataframe) and return these broken strings as list to dataframe. Scala version 2.11. I would prefer scala or pyspark solutions with udf's - because there is a lot happening inside the udf.
Let us say that I have a dataframe:
val df = List(("123", "a*b*c*d*e*f*x*y*z"), ("124", "g*h*i*j*k*l*m*n*o")).toDF("A", "B")
The result I want (in a udf, because there is a lot happening in there; Scala version 2.11) --
A B
123 ((a, b, c),
(d, e, f),
(x, y, z))
124 ((g, h, i),
(j, k, l),
(m, n, o))
Write a udf to break this and return lists - but, I do not know how to define or pass schema to get the results back into the dataframe as three columns.
def testUdf = udf( (s: String) => {
val a = s.split("\\*").take(3).toList
val b = s.split("\\*").drop(3).take(3).toList
val c = s.split("\\*").drop(6).take(3).toList
val abc = (a, b, c).zipped.toList.asInstanceOf[List[String]]
// println (abc) // This does not work
} )
val df2 = df.select($"A", testUdf($"B").as("B")) // does not work because of type mismatch.
I tried doing this - but, I do not know how to pass schema to the Udf above:
val schema = StructType(List(
StructField("C1", StringType),
StructField("C2", StringType),
StructField("C3", StringType)
))
Also, following this, I hope to follow the procedure outlined on Explode multiple columns in Spark SQL table to explode the dataframe.
Help would be greatly appreciated.