1

I have a RDD entitled name.

scala> name
res6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:37

I can inspect it using name.foreach(println)

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

I wish to create a new RDD that removes the name characters from the beginning of each record and returns the remaining numbers in long format.

Desired outcome:

5000005125651330
5000005125651331
5000005125651332
5000005125651333

I have tried the following:

val name_clean = name.filter(_ != "name")

However this returns:

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333
2
  • "However this returns" well, of course since every line isn't equal to "name". Something like name.map(_.drop(4).toLong) should do it (that just drops the first four characters unconditionally, it doesn't check that they're n a m e. Commented Aug 16, 2016 at 10:04
  • Thanks Paul. I didn't realise that. Worked! Feel free to post as an answer Commented Aug 16, 2016 at 10:07

2 Answers 2

4

Each entry in the RDD is a string. So comparing it to "name" will always fail, as it's "name"+some digits.

What you need is map to iterate over the RDD and return a new value for each entry. And that new value should be the string, without the first 4 characters, and converted to Long.

Putting that all together, we get

name.map(_.drop(4).toLong)

If you don't know the first four characters will be "name", you might want to check that first. What you need then depends on what you want to do with rows that don't have name as the first four, but something like

name.filter(_.startsWith("name")).map(_.drop(4).toLong)
Sign up to request clarification or add additional context in comments.

Comments

3

The method stripPrefix will remove a given prefix from a string (and do nothing if the string does not start with that prefix.

So you achieve what you need by:

val name_clean = name.map(_.stripPrefix("name").toLong)

3 Comments

Only the OP knows for sure, but it seems unlikely that if the first four characters are not "name", then they will be digits. So if the file only contains lines starting "name", this works (but you might as well just drop four characters). If some lines do not start "name", this will probably throw an error.
True. But depending on the context, one might prefer a runtime error to silently ignoring bad entries. If silent ignoring is wanted, then we can insert .filter(_.startsWith("name")) just like in your answer.
Sorry. No. Your code will only maybe give an exception depending on whether the erroneous line contains only digits or not. May or may not give exception on bad input is not good

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.