Decode binary data in Spark with native column functions

Question

I have an column of type binary. The values are 4 bytes long, and I would like to interpret them as an Int. An example DataFrame looks like this:

val df = Seq(
  (Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
  ).toDF("binary_value")

Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:

val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)

df.withColumn("numerical_value", bytesToInt('binary_value))

It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead. I was hoping to do something like 'binary_value.cast("array<byte>") and take it from there, or even 'binary_value.cast("int"), but Spark doesn't allow it.

Is there a way to interpret the binary column to another data type using Spark native functions?

ZygD · Accepted Answer · 2022-06-30 20:13:49Z

1

One way could be converting to hex (using hex) and then to dec (using conv).

conv(hex($"binary_value"), 16, 10)

df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
// +-------------+---------------+
// | binary_value|numerical_value|
// +-------------+---------------+
// |[00 00 02 E6]|            742|
// +-------------+---------------+

edited Jun 30, 2022 at 20:13

answered Jun 30, 2022 at 20:05

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

RvdV Over a year ago

That's great! Feels a bit inefficient to go via text, but it's just what I was looking for. I tried to do some benchmarking but didn't find a significant difference. Thanks!

Collectives™ on Stack Overflow

Decode binary data in Spark with native column functions

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related