6

I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?

bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")

Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.

4
  • Wait, do you have the RDD containing binary data, or do you need to run an external program to create it ? Running the program once you have the binary data is a classic question. Commented Jan 19, 2015 at 11:11
  • I have the binary data and need to create text data. Commented Jan 19, 2015 at 17:34
  • Yes, but have you managed to load it into an RDD, as the beginning of your question suggests (but contrary to what your answer suggests) ? Commented Jan 19, 2015 at 18:04
  • Correct, I am not able to read in the binary data and that is the source of the problem. Commented Jan 20, 2015 at 14:06

1 Answer 1

5

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.

  1. Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)

  2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.

Sign up to request clarification or add additional context in comments.

1 Comment

Can you split the binary data out into multiple binary files? That is how binaryFiles() is designed. But I'm afraid that even then, there is a memory bottleneck as noted in stackoverflow.com/q/30704814/507544

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.