How to split a big Sequence file into multiple sequence files?

Question

I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:

//Read from sequence file
  JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
  JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
  part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?

what you did is the way to go IMHO. if you want files to have same size use repartition instead of coalesce — Tal Joffe
– Tal Joffe, Commented May 3, 2017 at 14:18
but repartitioning is giving an error--> 17/05/03 23:10:46 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1) com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException Serialization trace: classes (sun.misc.Launcher$AppClassLoader) classLoader (org.apache.hadoop.mapred.JobConf) conf (org.apache.mahout.math.VectorWritable) ----detail trace---> pastebin.com/eDWvV6Fx @TalJoffe — user3086871
– user3086871, Commented May 3, 2017 at 17:12
I think the problem lies in shuffling, because if I use coalesce(3,true) same problem is thrown! — user3086871
– user3086871, Commented May 3, 2017 at 18:15
it is possible if the object in your RDD are not serializable... you can try making them serializable or another option would be to convert the RDD to Dataframe and then do repartitioning — Tal Joffe
– Tal Joffe, Commented May 4, 2017 at 6:31

vefthym · Accepted Answer · 2017-05-04 08:00:34Z

1

Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.

Your code should then look like this:

//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

Another thing that may cause problems is that some SequenceFiles are not splittable.

edited May 4, 2017 at 8:00

answered May 4, 2017 at 7:52

vefthym

7,5006 gold badges34 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

yanneke · Accepted Answer · 2017-05-03 13:35:47Z

0

Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way? It would be something like this:

int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
  files.get(i % 3).writeln(line);
  i++;
}

In this case, one line every three line goes into the frist, the second and the third file.

An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName)) and writing into your output files with Files.write(Paths.get(output1), byteToWrite).

However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.

answered May 3, 2017 at 13:35

yanneke

1067 bronze badges

1 Comment

user3086871 Over a year ago

it is not a text file, it is a sequence file. In the case of text file I could easily do this, I can also take line by line approach for sequence file I think but I am looking for what is the best approach from spark rdd perspective

Collectives™ on Stack Overflow

How to split a big Sequence file into multiple sequence files?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related