Extract RDD[(Array[String]) from RDD[(String, Array[String]) [Spark/scala]

Question

I have this prbolem, I have one of this kind RDD[(String, Array[String]), and I would like extract from it a RDD[Array[String]] that contains the values grouped by key:

e.g:

val x :RDD[(String, Array[String]) = 
RDD[(a, Array[ "ra", "re", "ri"]),
(a, Array[ "ta", "te", "ti"]),
(b, Array[ "pa", "pe", "pi"])]

I would like get:

val result: RDD[(String, RDD[Array[String]]) = 
RDD[(a, RDD[Array[("ra", "re", "ri"),( "ta", "te", "ti")]]),
(b,  RDD[Array[("pa", "pe", "pi"),...]])
,...]

Anahcolus · Accepted Answer · 2017-06-05 16:18:03Z

1

A simple reduceByKey should solve your issue

x.reduceByKey((prev, next)=> (prev ++ next))

answered Jun 5, 2017 at 16:18

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Leo C · Accepted Answer · 2017-06-05 18:50:54Z

1

As far as I know, Spark doesn't support nested RDDs (see this StackOverflow discussion).

In case nested arrays are good for what you need, a simple groupByKey will do:

val x = sc.parallelize(Seq(
  ("a", Array( "ra", "re", "ri" )),
  ("a", Array( "ta", "te", "ti" )),
  ("b", Array( "pa", "pe", "pi" ))
))

val x2 = x.groupByKey

x2.collect.foreach(println)
(a,CompactBuffer([Ljava.lang.String;@75043e31, [Ljava.lang.String;@18656538))
(b,CompactBuffer([Ljava.lang.String;@2cf30d2e))

x2.collect.foreach{ case (a, b) => println(a + ": " + b.map(_.mkString(" "))) }
a: List(ra re ri, ta te ti)
b: List(pa pe pi)

answered Jun 5, 2017 at 18:50

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Collectives™ on Stack Overflow

Extract RDD[(Array[String]) from RDD[(String, Array[String]) [Spark/scala]

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related