How to 'Pipe' Binary Data in Apache Spark

Question

I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?

bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")

Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.

Wait, do you have the RDD containing binary data, or do you need to run an external program to create it ? Running the program once you have the binary data is a classic question. — Francois G
– Francois G, Commented Jan 19, 2015 at 11:11
Yes, but have you managed to load it into an RDD, as the beginning of your question suggests (but contrary to what your answer suggests) ? — Francois G
– Francois G, Commented Jan 19, 2015 at 18:04
Correct, I am not able to read in the binary data and that is the source of the problem. — Nick Allen
– Nick Allen, Commented Jan 20, 2015 at 14:06

Nick Allen · Accepted Answer · 2015-01-16 18:40:32Z

5

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.

Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)
Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.

answered Jan 16, 2015 at 18:40

Nick Allen

1,4871 gold badge11 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

nealmcb Over a year ago

Can you split the binary data out into multiple binary files? That is how binaryFiles() is designed. But I'm afraid that even then, there is a memory bottleneck as noted in stackoverflow.com/q/30704814/507544

Collectives™ on Stack Overflow

How to 'Pipe' Binary Data in Apache Spark

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related