1

I want to filter a dataframe which has a column with categories (List[String]). I want to ignore all the rows that have a non valid category. They are not valid when they are not in model.getCategories

def checkIncomingData(model: Model, incomingData: DataFrame) : DataFrame = {
  val list = model.getCategories.toList
  sc.broadcast(list)
  incomingData.filter(incomingData("categories").isin(list))
}

Unfortunately my approach does not work because categories is a list, not a single element. Any idea who to make it work?

2 Answers 2

3

The first problem I see is that you didn't assign the broadcast to a variable.

val broadcastList = sc.broadcast(list)

Besides you have to reference it using broadcastList.value. For instance:

incomingData.filter($"categories".isin(broadcastList.value: _*))

NOTE @LostInOverflow made an important contribution, he clarified my answer and said that the method isin is actually evaluated in the driver, so broadcasting the list doesn't help at all, and more important the list shall be expanded in order to be evaluated.

Sign up to request clarification or add additional context in comments.

6 Comments

broadcast has no effect here.
Excuse me what? I tried to keep my answer in the dame context of the question
Think about the order of evaluation. Argument of isin is evaluated eagerly on the driver. It is not different than lncomingData.filter($"categories".isin(list))
Are you totally sure? Because I think this might be parallelized, if each node has the list (broadcasted), it could help to boost the performance. Maybe I'm wrong
I am sure. I upvoted because it is useful note but it doesn't resolve the problem. Please check my answer.
|
1

Just expand list:

incomingData.filter(incomingData("categories").isin(list: _*))

Note: broadcasting won't help you here. This is evaluated on driver.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.