1

I have an RDD which contains the lines of a file. I want for each partition NOT to contain the lines, but the concatenated lines. For example:

Partition 1        Partition 2
  line 1            line n/2+1
  line 2            line n/2+2
    .                  .
    .                  .
    .                  .
  line n/2          line n

Figure1 above shows my RDD, which is produced when we use sc.textFile() method. I want to go from figure 1 above to the one below (figure 2):

        Partition 1                        Partition 2
concatenatedLinesFrom1toN/2        concatenatedLinesFromN/2+1toN

Is there any way to map the partitions so I can convert the RDD from figure 1 to the one in Figure 2?

2 Answers 2

3

If you you need uniform object size (in-memory size / number of characters):

rdd.glom.map(_.mkString)

If you want a relativelly uniform number of lines not uniform size:

import org.apache.spark.RangePartitioner

val indexed = rdd.zipWithIndex.map(_.swap)
indexed.partitionBy(new RangePartitioner(2, indexed))
  .values
  .glom
  .map(_.mkString)

where rdd is a RDD[String] returned from textFile or similar method.

Sign up to request clarification or add additional context in comments.

Comments

2

You can use rdd.mapPartitions(itr) to achieve this. EDIT res0.mapPartitions(x=>Seq(x.mkString("")).iterator).collect

1 Comment

Perhaps elaborate a bit more on how this can help the OP. Simply saying "use this method" is usually not so helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.