0

My file is,

sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes

Here there are 7 rows & 5 columns(0,1,2,3,4)

I want the output as,

Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))

The output must be the type of [Map[Int,Set[String]]]

2 Answers 2

4

EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark

Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.

Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.

However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.

So, the algorithm:

  • Split each row into an array of strings, then change this into an array of sets of the single string value for each column - this can all be done with one map, and distributed.
  • Now reduce this using an operation that merges the set for each column in turn. This also can be distributed
  • turn the single row that results into a Map

It's no coincidence that we do a map, then a reduce, which should remind you of something :)

Here's a one-liner that produces the single row:

val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes") 

val row = data.map(_.split("\\W+").map(s=>Set(s)))
              .reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}

Converting it to a Map as the question asks:

val theMap = row.zipWithIndex.map(_.swap).toMap
  • Zip the list with the index, since that's what we need as the key of the map.
  • The elements of each tuple are unfortunately in the wrong order for .toMap, so swap them.
  • Then we have a list of (key, value) pairs which .toMap will turn into the desired result.

These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)

val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
             .reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}

(This can be converted into a Map as before)

An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)

data.map(_.split("\\W+")).transpose.map(_.toSet)

(Omitting the conversion to Map for clarity)

  • Split each string into words.
  • Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
  • Convert each of those to a set.
Sign up to request clarification or add additional context in comments.

10 Comments

nice one. Just wanted to say that, in the specific case, split(',') would work the same
Indeed it would. But the split is probably the least interesting bit of this :)
Yup, just a minor note in fact.
@Paul But 1 problem, here the " val rdd " is List(...) . I want the " val rdd " as RDD[String] not List[String]. If we change the List to RDD, WE CAN'T APPLY TRANSPOSE, SWAP etc....., Any solution?????
Sorry, I'm not experienced with RDD. If it doesn't support transpose then another solution may be best (swap however is about the elements in the RDD, and not the RDD itself so that should be usable since RDD does support map?). Or write your own transpose (it's five map operations, one for each index, which should paralleize nicely) - but in that case you could add things to the set in each map as you go, saving some time.
|
1

Maybe this do the trick:

    val a = Array(
      "sunny,hot,high,FALSE,no",
      "sunny,hot,high,TRUE,no",
      "overcast,hot,high,FALSE,yes",
      "rainy,mild,high,FALSE,yes",
      "rainy,cool,normal,FALSE,yes",
      "rainy,cool,normal,TRUE,no",
      "overcast,cool,normal,TRUE,yes")

    val b  = new Array[Map[String, Set[String]]](5)

    for (i <- 0 to 4)
      b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )

    println(b.mkString("\n"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.