Scala/Spark efficient partial string match

Question

I am writing a small program in Spark using Scala, and came across a problem. I have a List/RDD of single word strings and a List/RDD of sentences which might or might not contain words from the list of single words. i.e.

val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")

and I want to select the sentences that contains one or more of the words from singles such that the result should be something like:

output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]

I thought about two approaches, one by splitting the sentence and filtering using .contains. The other is to split and format sentence into a RDD and use the .join for RDD intersection. I am looking at around 50 single words and 5 million sentences, which method would be faster? Are there any other solutions? Could you also help me with the coding, I seem to get no results with my code (although it compiles and run without error)

Given that each word will get an avg of 100K sentences, grouping might not be a real option. (word, sentence) would be a better end format — maasg
– maasg, Commented Apr 21, 2015 at 16:09

Shyamendra Solanki · Accepted Answer · 2015-04-21 15:35:02Z

5

You can create a set of required keys, look up the keys in sentences and group by keys.

val singles = Array("this", "is")

val sentences = Array("this Date", 
                      "is there something", 
                      "where are something", 
                      "this is a string")

val rdd = sc.parallelize(sentences) // create RDD

val keys = singles.toSet            // words required as keys.

val result = rdd.flatMap{ sen => 
                    val words = sen.split(" ").toSet; 
                    val common = keys & words;       // intersect
                    common.map(x => (x, sen))        // map as key -> sen
                }
                .groupByKey.mapValues(_.toArray)     // group values for a key
                .collect                             // get rdd contents as array

// result:
// Array((this, Array(this Date, this is a string)),
//       (is,   Array(is there something, this is a string)))

answered Apr 21, 2015 at 15:35

Shyamendra Solanki

8,8512 gold badges33 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Reihan_amn Over a year ago

How should I run the substring matching? Imagine instead of the list of words, there is the list of phrases: val singles = Array("this book", "is great") There I cannot sprit the sentence! Any suggestion?

Fabio Fantoni · Accepted Answer · 2015-04-21 12:58:54Z

1

I've just tried to solve your problem and I've ended up with this code:

def check(s:String, l: Array[String]): Boolean = {
  var temp:Int = 0
  for (element <- l) {
    if (element.equals(s)) {temp = temp +1}
  }
  var result = false
  if (temp > 0) {result = true}
  result
}
val singles = sc.parallelize(Array("this", "is"))
val sentence = sc.parallelize(Array("this Date", "is there something", "where are something", "this is a string"))
val result = singles.cartesian(sentence)
                    .filter(x => check(x._1,x._2.split(" ")) == true )
                    .groupByKey()
                    .map(x => (x._1,x._2.mkString(", ") ))  // pay attention here(*)
result.foreach(println)

The last map line (*) is there just beacause without it I get something with CompactBuffer, like this:

(is,CompactBuffer(is there something, this is a string))     
(this,CompactBuffer(this Date, this is a string))

With that map line (with a mkString command) I get a more readable output like this:

(is,is there something, this is a string)
(this,this Date, this is a string)

Hope it could help in some way.

FF

answered Apr 21, 2015 at 12:58

Fabio Fantoni

3,1573 gold badges26 silver badges34 bronze badges

3 Comments

GameOfThrows Over a year ago

The cartesian for 5 million sentences is going to be a tough cookie, but good answer never the less.

Fabio Fantoni Over a year ago

you're probably right... but you could give it a try and see how it works... I admit that it's just a "quick" answer and I could find a way to improve it.

GameOfThrows Over a year ago

It's actually not bad at all, considering it is using Spark RDD, it is slightly slower than my version which runs only at the Master, but I think for more data, your is much better than mine; besides, cartesian might be the most efficient way to search if you think about it.

Collectives™ on Stack Overflow

Scala/Spark efficient partial string match

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related