0

I'm a beginner with Scala.

I've got a dataframe with 2 columns :

the first is a date, the second an array of words.

created_at:string
words:array
    element:string

I wish to keep only words begining with a '#'

I would prefer to make the filter before exploding the array, as most words do not start with a '#'

I didn't find a way to modify an array column and apply something like a filter(_.startsWith("#")).

Is it possible ? and how ?

Thank's

Pierre

1

2 Answers 2

3

You can create a simple UDF to filter out the unwanted words from your array column:

val df = Seq(
  ("2018-05-01", Seq("a", "#b", "c")),
  ("2018-05-02", Seq("#d", "#e", "f"))
).toDF("created_at", "words")

def filterArray = udf( (s: Seq[String]) =>
  s.filterNot(_.startsWith("#"))
)

df.select($"created_at", filterArray($"words")).show
// +----------+----------+
// |created_at|UDF(words)|
// +----------+----------+
// |2018-05-01|    [a, c]|
// |2018-05-02|       [f]|
// +----------+----------+
Sign up to request clarification or add additional context in comments.

1 Comment

Exactly what I wanted. Thanks !
0

Try this one:

import org.apache.spark.sql.functions._ 

df.select(explode(col("words")).as("word"), col("created_at")).
       where("word LIKE '#%'").
       groupBy(col("created_at")).
       agg(collect_set(col("word")).as("words")).
       show

1 Comment

It should work, but it explode the array first. I wanted to avoid it. Still a good example of code, thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.