0

I have this prbolem, I have one of this kind RDD[(String, Array[String]), and I would like extract from it a RDD[Array[String]] that contains the values grouped by key:

e.g:

val x :RDD[(String, Array[String]) = 
RDD[(a, Array[ "ra", "re", "ri"]),
(a, Array[ "ta", "te", "ti"]),
(b, Array[ "pa", "pe", "pi"])]

I would like get:

val result: RDD[(String, RDD[Array[String]]) = 
RDD[(a, RDD[Array[("ra", "re", "ri"),( "ta", "te", "ti")]]),
(b,  RDD[Array[("pa", "pe", "pi"),...]])
,...]

2 Answers 2

1

A simple reduceByKey should solve your issue

x.reduceByKey((prev, next)=> (prev ++ next))
Sign up to request clarification or add additional context in comments.

Comments

1

As far as I know, Spark doesn't support nested RDDs (see this StackOverflow discussion).

In case nested arrays are good for what you need, a simple groupByKey will do:

val x = sc.parallelize(Seq(
  ("a", Array( "ra", "re", "ri" )),
  ("a", Array( "ta", "te", "ti" )),
  ("b", Array( "pa", "pe", "pi" ))
))

val x2 = x.groupByKey

x2.collect.foreach(println)
(a,CompactBuffer([Ljava.lang.String;@75043e31, [Ljava.lang.String;@18656538))
(b,CompactBuffer([Ljava.lang.String;@2cf30d2e))

x2.collect.foreach{ case (a, b) => println(a + ": " + b.map(_.mkString(" "))) }
a: List(ra re ri, ta te ti)
b: List(pa pe pi)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.