How to split RDD of (String, Array[String]) into RDD of (String, String) for each item in array?

Question

I have a PairRDD in the form RDD[(String, Array[String])]. I want to flatten the values so that I have an RDD[(String, String)] where each of the elements in the Array[String] of the first RDD become a dedicated element in the 2nd RDD.

For instance, my first RDD has the following elements:

("a", Array("x", "y"))
("b", Array("y", "z"))

The result I want is this:

("a", "x")
("a", "y")
("b", "y")
("b", "z")

How can I do this? flatMapValues(f: Array[String] => TraverableOnce[String]) seems to be the right choice here, but what do I need to use as argument f?

@kaktusito Right thanks; I've updated the question because I was actually looking for the argument to pass into flatMapValues(). You've made that clean. — Carsten
– Carsten, Commented Sep 3, 2015 at 18:40
@Carsten I would use identity instead of x => x. The scala compiler is probably clever enough to realize that that's identity but maybe not and then you create a new object. — 2rs2ts
– 2rs2ts, Commented Sep 3, 2015 at 18:41
Is there any difference using this instead: rdd.flatMap{ case (a,b) => b.map(a->_) } ? Does flatMapValues do anything different ? — tuxdna
– tuxdna, Commented Sep 4, 2015 at 7:47
@tuxdna There's a performance reason, I believe. flatMap is not guaranteed to keep the partitioner of the original rdd (since there's no way to check that the keys will remain the same), while flatMapValues will. This is important when doing operations that require shuffling, as joins. — ale64bit
– ale64bit, Commented Sep 4, 2015 at 11:09

3 revs, 2 users 92% · Accepted Answer · 2015-09-04 10:14:35Z

4

To achieve the desired result, do:

val rdd1: RDD[(Any, Array[Any])] = ...
val rddFlat: RDD[(Any, Any)] = rdd1.flatMapValues(identity[Array[Any]])

The result looks like the one asked for in the question.

edited Sep 4, 2015 at 10:14

community wiki

3 revs, 2 users 92%
Carsten

Sign up to request clarification or add additional context in comments.

1 Comment

Jacek Laskowski Over a year ago

protip: It should be a Wiki answer instead since you simply gathered the comments.

Collectives™ on Stack Overflow

How to split RDD of (String, Array[String]) into RDD of (String, String) for each item in array?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related