15
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)

in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?

4 Answers 4

29

tokenizedLines.filter(_.nonEmpty)

Sign up to request clarification or add additional context in comments.

5 Comments

does this filter the result or remove them? I am keen on removing them.
The filter returns a new collection with no empty strings.
when i print the tokenized lines post filtering, it still has empty strings in array buffer. Am i missing something additionally?
As in scaladoc, RDDs are immutable, so you can not modify them in place. And you should avoid mutable data structures as long as possible in Scala. So you may write val tokenizedLines = lines.map(Tokenizer.tokenize).filter(_.nonEmpty).
You can not remove anything from RDD, its immutable, solution is correct as you can collection filtered data for your purpose and use where ever you want to use map without empty string.
11

The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.

Instead, one could use this, which is assured to be faster:

tokenizedLines.filterNot(_.isEmpty)

Comments

1

You could use flatMap with Option.

Something like that:

lines.flatMap{
     case "" => None 
     case s => Some(s)
}

1 Comment

ow lines.flatMap{case "" => Nil case s => Seq(s)} so ?
0
val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.