1

I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:

//Read from sequence file
  JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
  JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
  part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?

4
  • what you did is the way to go IMHO. if you want files to have same size use repartition instead of coalesce Commented May 3, 2017 at 14:18
  • but repartitioning is giving an error--> 17/05/03 23:10:46 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1) com.esotericsoftware.kryo.KryoException: java.util.ConcurrentModificationException Serialization trace: classes (sun.misc.Launcher$AppClassLoader) classLoader (org.apache.hadoop.mapred.JobConf) conf (org.apache.mahout.math.VectorWritable) ----detail trace---> pastebin.com/eDWvV6Fx @TalJoffe Commented May 3, 2017 at 17:12
  • I think the problem lies in shuffling, because if I use coalesce(3,true) same problem is thrown! Commented May 3, 2017 at 18:15
  • 1
    it is possible if the object in your RDD are not serializable... you can try making them serializable or another option would be to convert the RDD to Dataframe and then do repartitioning Commented May 4, 2017 at 6:31

2 Answers 2

1

Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.

Your code should then look like this:

//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

Another thing that may cause problems is that some SequenceFiles are not splittable.

Sign up to request clarification or add additional context in comments.

Comments

0

Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way? It would be something like this:

int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
  files.get(i % 3).writeln(line);
  i++;
}

In this case, one line every three line goes into the frist, the second and the third file.

An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName)) and writing into your output files with Files.write(Paths.get(output1), byteToWrite).

However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.

1 Comment

it is not a text file, it is a sequence file. In the case of text file I could easily do this, I can also take line by line approach for sequence file I think but I am looking for what is the best approach from spark rdd perspective

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.