Spark - concatenate strings of each partition to a single string

Question

I have an RDD which contains the lines of a file. I want for each partition NOT to contain the lines, but the concatenated lines. For example:

Partition 1        Partition 2
  line 1            line n/2+1
  line 2            line n/2+2
    .                  .
    .                  .
    .                  .
  line n/2          line n

Figure1 above shows my RDD, which is produced when we use sc.textFile() method. I want to go from figure 1 above to the one below (figure 2):

        Partition 1                        Partition 2
concatenatedLinesFrom1toN/2        concatenatedLinesFromN/2+1toN

Is there any way to map the partitions so I can convert the RDD from figure 1 to the one in Figure 2?

zero323 · Accepted Answer · 2016-02-20 16:56:22Z

3

If you you need uniform object size (in-memory size / number of characters):

rdd.glom.map(_.mkString)

If you want a relativelly uniform number of lines not uniform size:

import org.apache.spark.RangePartitioner

val indexed = rdd.zipWithIndex.map(_.swap)
indexed.partitionBy(new RangePartitioner(2, indexed))
  .values
  .glom
  .map(_.mkString)

where rdd is a RDD[String] returned from textFile or similar method.

edited Feb 20, 2016 at 16:56

answered Feb 20, 2016 at 16:47

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pankaj Arora · Accepted Answer · 2016-02-20 16:51:56Z

2

You can use rdd.mapPartitions(itr) to achieve this. EDIT res0.mapPartitions(x=>Seq(x.mkString("")).iterator).collect

edited Feb 20, 2016 at 16:51

answered Feb 20, 2016 at 16:27

Pankaj Arora

5442 silver badges6 bronze badges

1 Comment

Yuval Itzchakov Over a year ago

Perhaps elaborate a bit more on how this can help the OP. Simply saying "use this method" is usually not so helpful.

Collectives™ on Stack Overflow

Spark - concatenate strings of each partition to a single string

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related