2

I want to filter out alphanumeric and numeric words from my file. I'm working on Spark-Shell. These are the contents of my file sparktest.txt:

This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?

Defining the file for collection:

scala> val myLines = sc.textFile("sparktest.txt")

Saving the line into an Array with words of length greater than 2:

scala> val myWords = myLines.flatMap(x => x.split("\\W+")).filter(x => x.length >2)

Defining a regular expression to use. I only want string that match "[A-Za-z]+":

scala> val regexpr = "[A-Za-z]+".r

Attempting to filter out the alphanumeric and numeric strings:

scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
       val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)

This is where I'm stuck. I want the result to look like this:

Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

2 Answers 2

4

You can actually do this in one transformation and filter the split arrays within your flatMap:

val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))

When I run this in spark-shell, I see:

scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21

scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23

scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
Sign up to request clarification or add additional context in comments.

3 Comments

Awesome! Thanks. The one line worked too! I like the use of &&. I like coding line by line, but I'm quickly realizing the attraction in combing commands in one line. Thank you! Why did you use sc.parallelize instead of sc.textFile?
The parallelize method distributes a collection across an RDD so just distributed a collection containing your example string in lieu of loading it from a file. Both the parallelize and the textFile methods have different uses. If the answer worked (Alexandr's as well), feel free to up vote it so that people can know it's right.
Thanks again @Rohan Aletty. Can I contact you via private message about the future of Big Data and Hadoop?
1

You can use filter(x => regexpr.pattern.matcher(x).matches) or filter(_.matches("[A-Za-z]+"))

1 Comment

Thanks! both worked. I like filter(_.matches("[A-Za-z]+")) better!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.