3

I am writing a small program in Spark using Scala, and came across a problem. I have a List/RDD of single word strings and a List/RDD of sentences which might or might not contain words from the list of single words. i.e.

val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")

and I want to select the sentences that contains one or more of the words from singles such that the result should be something like:

output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]

I thought about two approaches, one by splitting the sentence and filtering using .contains. The other is to split and format sentence into a RDD and use the .join for RDD intersection. I am looking at around 50 single words and 5 million sentences, which method would be faster? Are there any other solutions? Could you also help me with the coding, I seem to get no results with my code (although it compiles and run without error)

1
  • Given that each word will get an avg of 100K sentences, grouping might not be a real option. (word, sentence) would be a better end format Commented Apr 21, 2015 at 16:09

2 Answers 2

5

You can create a set of required keys, look up the keys in sentences and group by keys.

val singles = Array("this", "is")

val sentences = Array("this Date", 
                      "is there something", 
                      "where are something", 
                      "this is a string")

val rdd = sc.parallelize(sentences) // create RDD

val keys = singles.toSet            // words required as keys.

val result = rdd.flatMap{ sen => 
                    val words = sen.split(" ").toSet; 
                    val common = keys & words;       // intersect
                    common.map(x => (x, sen))        // map as key -> sen
                }
                .groupByKey.mapValues(_.toArray)     // group values for a key
                .collect                             // get rdd contents as array

// result:
// Array((this, Array(this Date, this is a string)),
//       (is,   Array(is there something, this is a string)))
Sign up to request clarification or add additional context in comments.

1 Comment

How should I run the substring matching? Imagine instead of the list of words, there is the list of phrases: val singles = Array("this book", "is great") There I cannot sprit the sentence! Any suggestion?
1

I've just tried to solve your problem and I've ended up with this code:

def check(s:String, l: Array[String]): Boolean = {
  var temp:Int = 0
  for (element <- l) {
    if (element.equals(s)) {temp = temp +1}
  }
  var result = false
  if (temp > 0) {result = true}
  result
}
val singles = sc.parallelize(Array("this", "is"))
val sentence = sc.parallelize(Array("this Date", "is there something", "where are something", "this is a string"))
val result = singles.cartesian(sentence)
                    .filter(x => check(x._1,x._2.split(" ")) == true )
                    .groupByKey()
                    .map(x => (x._1,x._2.mkString(", ") ))  // pay attention here(*)
result.foreach(println)

The last map line (*) is there just beacause without it I get something with CompactBuffer, like this:

(is,CompactBuffer(is there something, this is a string))     
(this,CompactBuffer(this Date, this is a string))

With that map line (with a mkString command) I get a more readable output like this:

(is,is there something, this is a string)
(this,this Date, this is a string)

Hope it could help in some way.

FF

3 Comments

The cartesian for 5 million sentences is going to be a tough cookie, but good answer never the less.
you're probably right... but you could give it a try and see how it works... I admit that it's just a "quick" answer and I could find a way to improve it.
It's actually not bad at all, considering it is using Spark RDD, it is slightly slower than my version which runs only at the Master, but I think for more data, your is much better than mine; besides, cartesian might be the most efficient way to search if you think about it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.