Scala RDD String manipulation

Question

I have a RDD entitled name.

scala> name
res6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:37

I can inspect it using name.foreach(println)

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

I wish to create a new RDD that removes the name characters from the beginning of each record and returns the remaining numbers in long format.

Desired outcome:

5000005125651330
5000005125651331
5000005125651332
5000005125651333

I have tried the following:

val name_clean = name.filter(_ != "name")

However this returns:

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

"However this returns" well, of course since every line isn't equal to "name". Something like name.map(_.drop(4).toLong) should do it (that just drops the first four characters unconditionally, it doesn't check that they're n a m e. — The Archetypal Paul
– The Archetypal Paul, Commented Aug 16, 2016 at 10:04
Thanks Paul. I didn't realise that. Worked! Feel free to post as an answer — LearningSlowly
– LearningSlowly, Commented Aug 16, 2016 at 10:07

The Archetypal Paul · Accepted Answer · 2016-08-16 10:28:14Z

4

Each entry in the RDD is a string. So comparing it to "name" will always fail, as it's "name"+some digits.

What you need is map to iterate over the RDD and return a new value for each entry. And that new value should be the string, without the first 4 characters, and converted to Long.

Putting that all together, we get

name.map(_.drop(4).toLong)

If you don't know the first four characters will be "name", you might want to check that first. What you need then depends on what you want to do with rows that don't have name as the first four, but something like

name.filter(_.startsWith("name")).map(_.drop(4).toLong)

answered Aug 16, 2016 at 10:28

The Archetypal Paul

41.9k20 gold badges107 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dominique Unruh · Accepted Answer · 2016-08-16 10:52:55Z

3

The method stripPrefix will remove a given prefix from a string (and do nothing if the string does not start with that prefix.

So you achieve what you need by:

val name_clean = name.map(_.stripPrefix("name").toLong)

answered Aug 16, 2016 at 10:52

Dominique Unruh

1,2788 silver badges24 bronze badges

3 Comments

The Archetypal Paul Over a year ago

Only the OP knows for sure, but it seems unlikely that if the first four characters are not "name", then they will be digits. So if the file only contains lines starting "name", this works (but you might as well just drop four characters). If some lines do not start "name", this will probably throw an error.

Dominique Unruh Over a year ago

True. But depending on the context, one might prefer a runtime error to silently ignoring bad entries. If silent ignoring is wanted, then we can insert .filter(_.startsWith("name")) just like in your answer.

The Archetypal Paul Over a year ago

Sorry. No. Your code will only maybe give an exception depending on whether the erroneous line contains only digits or not. May or may not give exception on bad input is not good

Collectives™ on Stack Overflow

Scala RDD String manipulation

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related