How to filter a spark's dataframe array with scala

Question

I'm a beginner with Scala.

I've got a dataframe with 2 columns :

the first is a date, the second an array of words.

created_at:string
words:array
    element:string

I wish to keep only words begining with a '#'

I would prefer to make the filter before exploding the array, as most words do not start with a '#'

I didn't find a way to modify an array column and apply something like a filter(_.startsWith("#")).

Is it possible ? and how ?

Thank's

Pierre

stackoverflow.com/questions/43904622/…

vvg
– vvg

2018-05-16 21:52:43 +00:00
Commented May 16, 2018 at 21:52 — vvg
– vvg, Commented May 16, 2018 at 21:52

Leo C · Accepted Answer · 2018-05-16 22:22:18Z

3

You can create a simple UDF to filter out the unwanted words from your array column:

val df = Seq(
  ("2018-05-01", Seq("a", "#b", "c")),
  ("2018-05-02", Seq("#d", "#e", "f"))
).toDF("created_at", "words")

def filterArray = udf( (s: Seq[String]) =>
  s.filterNot(_.startsWith("#"))
)

df.select($"created_at", filterArray($"words")).show
// +----------+----------+
// |created_at|UDF(words)|
// +----------+----------+
// |2018-05-01|    [a, c]|
// |2018-05-02|       [f]|
// +----------+----------+

answered May 16, 2018 at 22:22

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pierre C Over a year ago

Exactly what I wanted. Thanks !

Binzi Cao · Accepted Answer · 2018-05-17 02:16:34Z

0

Try this one:

import org.apache.spark.sql.functions._ 

df.select(explode(col("words")).as("word"), col("created_at")).
       where("word LIKE '#%'").
       groupBy(col("created_at")).
       agg(collect_set(col("word")).as("words")).
       show

answered May 17, 2018 at 2:16

Binzi Cao

1,0855 silver badges14 bronze badges

1 Comment

Pierre C Over a year ago

It should work, but it explode the array first. I wanted to avoid it. Still a good example of code, thanks

Collectives™ on Stack Overflow

How to filter a spark's dataframe array with scala

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related