How to extract specific values from an array of strings in a dataframe column?

Question

I have a dataframe with two columns (words, numbers) such that under words I have an array of Strings and numbers I have an array of Integers.

For example:

words: ["hello","there","Everyone"] and numbers: [0,4,5]

I would like to be able to get the words where the integer in numbers is not 0. thus in the above scenario only "there" and "Everyone" should be returned.

I am still a beginner in scala and spark and thus I tried filter, but how could I get inside the array? and how could I return the words ?

like df.filter(col("numbers") != 0)

cheseaux · Accepted Answer · 2018-12-20 22:14:37Z

2

You could simply define the following UDF :

val myUDF = udf { (a : Seq[String], b : Seq[Int]) => 
  a.zip(b).filter(_._2 != 0).map(_._1) 
}

It zips together both arrays and filter based on the integer values.

df.select(myUDF($"words", $"numbers").as("words")).show

Returns the corresponding words in an array

+-----------------+
|            words|
+-----------------+
|[there, everyone]|
+-----------------+

If you want each word on a separate row you can use explode :

df.select(explode(myUDF($"words", $"numbers")).as("words")).show

Which results in

+--------+
|   words|
+--------+
|   there|
|everyone|
+--------+

answered Dec 20, 2018 at 22:14

cheseaux

5,32533 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thirdage Over a year ago

Can you explain what this line "a.zip(b).filter(._2 != 0).map(._1)" really does ? how does the "_" work and what is zip to start with ?

cheseaux Over a year ago

Here is a good reference to start with : twitter.github.io/scala_school/collections.html

Collectives™ on Stack Overflow

How to extract specific values from an array of strings in a dataframe column?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related