How to create a map from a RDD[String] using scala?

Question

My file is,

sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes

Here there are 7 rows & 5 columns(0,1,2,3,4)

I want the output as,

Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))

The output must be the type of [Map[Int,Set[String]]]

The Archetypal Paul · Accepted Answer · 2014-11-13 19:10:55Z

4

EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark

Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.

Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.

However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.

So, the algorithm:

Split each row into an array of strings, then change this into an array of sets of the single string value for each column - this can all be done with one map, and distributed.
Now reduce this using an operation that merges the set for each column in turn. This also can be distributed
turn the single row that results into a Map

It's no coincidence that we do a map, then a reduce, which should remind you of something :)

Here's a one-liner that produces the single row:

val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes") 

val row = data.map(_.split("\\W+").map(s=>Set(s)))
              .reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}

Converting it to a Map as the question asks:

val theMap = row.zipWithIndex.map(_.swap).toMap

Zip the list with the index, since that's what we need as the key of the map.
The elements of each tuple are unfortunately in the wrong order for .toMap, so swap them.
Then we have a list of (key, value) pairs which .toMap will turn into the desired result.

These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)

val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
             .reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}

(This can be converted into a Map as before)

An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)

data.map(_.split("\\W+")).transpose.map(_.toSet)

(Omitting the conversion to Map for clarity)

Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.

edited Nov 13, 2014 at 19:10

answered Nov 11, 2014 at 12:00

The Archetypal Paul

41.9k20 gold badges107 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Gabriele Petronella Over a year ago

nice one. Just wanted to say that, in the specific case, split(',') would work the same

The Archetypal Paul Over a year ago

Indeed it would. But the split is probably the least interesting bit of this :)

Gabriele Petronella Over a year ago

Yup, just a minor note in fact.

rosy Over a year ago

@Paul But 1 problem, here the " val rdd " is List(...) . I want the " val rdd " as RDD[String] not List[String]. If we change the List to RDD, WE CAN'T APPLY TRANSPOSE, SWAP etc....., Any solution?????

The Archetypal Paul Over a year ago

Sorry, I'm not experienced with RDD. If it doesn't support transpose then another solution may be best (swap however is about the elements in the RDD, and not the RDD itself so that should be usable since RDD does support map?). Or write your own transpose (it's five map operations, one for each index, which should paralleize nicely) - but in that case you could add things to the set in each map as you go, saving some time.

|

ale64bit · Accepted Answer · 2014-11-11 11:22:25Z

1

Maybe this do the trick:

    val a = Array(
      "sunny,hot,high,FALSE,no",
      "sunny,hot,high,TRUE,no",
      "overcast,hot,high,FALSE,yes",
      "rainy,mild,high,FALSE,yes",
      "rainy,cool,normal,FALSE,yes",
      "rainy,cool,normal,TRUE,no",
      "overcast,cool,normal,TRUE,yes")

    val b  = new Array[Map[String, Set[String]]](5)

    for (i <- 0 to 4)
      b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )

    println(b.mkString("\n"))

answered Nov 11, 2014 at 11:22

ale64bit

6,2723 gold badges27 silver badges46 bronze badges

Collectives™ on Stack Overflow

How to create a map from a RDD[String] using scala?

2 Answers 2

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related