Write and read raw byte arrays in Spark - using Sequence File SequenceFile

Question

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?

samthebest · Accepted Answer · 2014-06-06 13:42:55Z

15

Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes

val rdd: RDD[Array[Byte]] = ???

// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
  .saveAsSequenceFile("/output/path", codecOpt)

// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
  .map(_._2.copyBytes())

answered Jun 6, 2014 at 13:42

samthebest

31.7k25 gold badges106 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sam Stoelinga Over a year ago

This post is relatively old so just wanted to know whether the answer is still up to date? Is it still necessary to use copyBytes before reading?

samthebest Over a year ago

@SamStoelinga Yes I think so, it's Hadoop API that is unlikely to change.

user1609012 Over a year ago

A more efficient alternative is to use <BytesWritableInstance>.getBytes() and process only up to <BytesWritableInstance>.getLength() bytes. Of course, if you strictly need an RDD[Array[Byte]], this approach won't work, but you could consider an RDD[(Array[Byte], Int)].

Choix Over a year ago

Can anyone post an entire working code snippet including what packages to be imported? Thanks.

Chris Bedford Over a year ago

@Choix - I had the same issue. Posting snippet that solved my problem as a separate answer.

Chris Bedford · Accepted Answer · 2019-07-27 22:17:00Z

0

Here is a snippet with all required imports that you can run from spark-shell, as requested by @Choix

import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.NullWritable

val path = "/tmp/path"

val rdd = sc.parallelize(List("foo"))
val bytesRdd = rdd.map{str  =>  (NullWritable.get, new BytesWritable(str.getBytes) )  }
bytesRdd.saveAsSequenceFile(path)

val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
val recoveredAsString = recovered.map( new String(_) )
recoveredAsString.collect()
// result is:  Array[String] = Array(foo)

answered Jul 27, 2019 at 22:17

Chris Bedford

2,7544 gold badges37 silver badges68 bronze badges

Collectives™ on Stack Overflow

Write and read raw byte arrays in Spark - using Sequence File SequenceFile

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related