3

I have dataframe with 2 ArrayType columns. I want to find the difference between columns. column1 will always have values while column2 may have empty array. I created following udf but it is not working

df.show() gives following records

SampleData:

["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]

Code:

sc.udf.register("diff", (value: Column,value1: Column)=>{ 
                        value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])          
                    })  

Output:

["Test2","Test3"]

Spark version 1.4.1 Any help will be appreciated.

7
  • what was the result ? Commented Dec 15, 2016 at 9:12
  • it gives all values of value Commented Dec 15, 2016 at 9:13
  • can you paste sample data pls? ideally it should work Commented Dec 15, 2016 at 9:15
  • I hope you have used collection.SeqLike.diff Commented Dec 15, 2016 at 9:18
  • Please share example data and expected output. Commented Dec 15, 2016 at 9:19

2 Answers 2

1

You need to change your udf to:

val diff_udf = udf { ( a:  Seq[String], 
                       b:  Seq[String]) => a diff b }

Then this works:

import org.apache.spark.sql.functions.col
df.withColumn("diff",
  diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
|                col1|             col2|              diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+

Data

val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"), 
                             List("Test", "Test1")))).toDF("col1", "col2")
Sign up to request clarification or add additional context in comments.

Comments

1

column1 will always have values while column2 may have empty array.

your comment : it gives all values of value – undefined_variable

Example1 :

lets see small example like this...

   val A = Seq(1,1)

 A: Seq[Int] = List(1, 1)

 val B = Seq.empty

 B: Seq[Nothing] = List()
    
A diff B

 res0: Seq[Int] = List(1, 1)

if you do a collection.SeqLike.diff then you will get A value as shown in example. As per scala, this is very much valid case since you told you are always getting value which is seq.

Also, reverse case is like this...

 B diff A

 res1: Seq[Nothing] = List()

if you use Spark udf for doing above as well then same results will come.

EDIT : (if one array not empty case as you modified your example )

Example2 :

 val p = Seq("Test", "Test1","Test3", "Test2")

 p: Seq[String] = List(Test, Test1, Test3, Test2)

 val q = Seq("Test", "Test1")

 q: Seq[String] = List(Test, Test1)

 p diff q

 res2: Seq[String] = List(Test3, Test2)

This is what your expected output which is coming as given in your example.

Reverse case : I think this is what you are getting which is not expected by you.

q diff p

 res3: Seq[String] = List()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.