1

I want to transform a column. The new column should only contain a partition of the original column. I defined the following udf:

def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(16).toSeq(index))

To use it in a loop later with

myDF = myDF.withColumn("measurement_"+i,extract(i)($"vector"))

The original vector column was created with:

var vectors :Seq[Seq[Double]] = myVectors
vectors.toDF("vector")

But in the end I get the following error:

Failed to execute user defined function(anonfun$user$sparkapp$MyClass$$extract$2$1: (array<double>) => array<double>)

Have I defined the udf incorrectly?

1 Answer 1

3

I can reproduce the error when I try to extract the elements that don't exist, i.e. give an index that is larger than the sequence length:

val myDF = Seq(Seq(1.0, 2.0 ,3, 4.0), Seq(4.0,3,2,1)).toDF("vector")
myDF: org.apache.spark.sql.DataFrame = [vector: array<double>]

def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq(index))
// extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction

val i = 2

myDF.withColumn("measurement_"+i,extract(i)($"vector")).show

Gives this error:

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$extract$1: (array<double>) => array<double>)

Most likely you have the same problem while doing toSeq(index), try use toSeq.lift(index) which returns None if the index is out of bound:

def extract (index : Integer) = udf((v: Seq[Double]) => v.grouped(2).toSeq.lift(index))
extract: (index: Integer)org.apache.spark.sql.expressions.UserDefinedFunction

Normal index:

val i = 1    
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
|              vector|measurement_1|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]|   [3.0, 4.0]|
|[4.0, 3.0, 2.0, 1.0]|   [2.0, 1.0]|
+--------------------+-------------+

Index out of bound:

val i = 2
myDF.withColumn("measurement_"+i,extract(i)($"vector")).show
+--------------------+-------------+
|              vector|measurement_2|
+--------------------+-------------+
|[1.0, 2.0, 3.0, 4.0]|         null|
|[4.0, 3.0, 2.0, 1.0]|         null|
+--------------------+-------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot, the debugging of this error has cost me a lot of time. +1 for your detailed answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.