1

I have a dataframe with two columns (words, numbers) such that under words I have an array of Strings and numbers I have an array of Integers.

For example:

words: ["hello","there","Everyone"] and numbers: [0,4,5]

I would like to be able to get the words where the integer in numbers is not 0. thus in the above scenario only "there" and "Everyone" should be returned.

I am still a beginner in scala and spark and thus I tried filter, but how could I get inside the array? and how could I return the words ?

like df.filter(col("numbers") != 0)
0

1 Answer 1

2

You could simply define the following UDF :

val myUDF = udf { (a : Seq[String], b : Seq[Int]) => 
  a.zip(b).filter(_._2 != 0).map(_._1) 
}

It zips together both arrays and filter based on the integer values.

df.select(myUDF($"words", $"numbers").as("words")).show

Returns the corresponding words in an array

+-----------------+
|            words|
+-----------------+
|[there, everyone]|
+-----------------+

If you want each word on a separate row you can use explode :

df.select(explode(myUDF($"words", $"numbers")).as("words")).show

Which results in

+--------+
|   words|
+--------+
|   there|
|everyone|
+--------+
Sign up to request clarification or add additional context in comments.

2 Comments

Can you explain what this line "a.zip(b).filter(._2 != 0).map(._1)" really does ? how does the "_" work and what is zip to start with ?
Here is a good reference to start with : twitter.github.io/scala_school/collections.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.