How to filter out alphanumeric strings in Scala using regular expression

Question

I want to filter out alphanumeric and numeric words from my file. I'm working on Spark-Shell. These are the contents of my file sparktest.txt:

This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?

Defining the file for collection:

scala> val myLines = sc.textFile("sparktest.txt")

Saving the line into an Array with words of length greater than 2:

scala> val myWords = myLines.flatMap(x => x.split("\\W+")).filter(x => x.length >2)

Defining a regular expression to use. I only want string that match "[A-Za-z]+":

scala> val regexpr = "[A-Za-z]+".r

Attempting to filter out the alphanumeric and numeric strings:

scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
       val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)

This is where I'm stuck. I want the result to look like this:

Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

Rohan Aletty · Accepted Answer · 2015-11-03 21:55:52Z

4

You can actually do this in one transformation and filter the split arrays within your flatMap:

val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))

When I run this in spark-shell, I see:

scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21

scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23

scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

answered Nov 3, 2015 at 21:55

Rohan Aletty

2,4421 gold badge17 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

TheWalkingData Over a year ago

Awesome! Thanks. The one line worked too! I like the use of &&. I like coding line by line, but I'm quickly realizing the attraction in combing commands in one line. Thank you! Why did you use sc.parallelize instead of sc.textFile?

Rohan Aletty Over a year ago

The parallelize method distributes a collection across an RDD so just distributed a collection containing your example string in lieu of loading it from a file. Both the parallelize and the textFile methods have different uses. If the answer worked (Alexandr's as well), feel free to up vote it so that people can know it's right.

TheWalkingData Over a year ago

Thanks again @Rohan Aletty. Can I contact you via private message about the future of Big Data and Hadoop?

Alexandr Dorokhin · Accepted Answer · 2015-11-03 21:43:15Z

1

You can use filter(x => regexpr.pattern.matcher(x).matches) or filter(_.matches("[A-Za-z]+"))

answered Nov 3, 2015 at 21:43

Alexandr Dorokhin

8505 silver badges11 bronze badges

1 Comment

TheWalkingData Over a year ago

Thanks! both worked. I like filter(_.matches("[A-Za-z]+")) better!

Collectives™ on Stack Overflow

How to filter out alphanumeric strings in Scala using regular expression

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related