1

I have an column of type binary. The values are 4 bytes long, and I would like to interpret them as an Int. An example DataFrame looks like this:

val df = Seq(
  (Array(0x00.toByte, 0x00.toByte, 0x02.toByte, 0xe6.toByte))
  ).toDF("binary_value")

Where the 4 bytes in this example can be interpreted as an U32 to form the number 742. Using a UDF the value can be decoded like this:

val bytesToInt = udf((x: Array[Byte]) => BigInt(x).toInt)

df.withColumn("numerical_value", bytesToInt('binary_value))

It works, but at the cost of using a UDF and corresponding serialization / deserialization overhead. I was hoping to do something like 'binary_value.cast("array<byte>") and take it from there, or even 'binary_value.cast("int"), but Spark doesn't allow it.

Is there a way to interpret the binary column to another data type using Spark native functions?

0

1 Answer 1

1

One way could be converting to hex (using hex) and then to dec (using conv).

conv(hex($"binary_value"), 16, 10)
df.withColumn("numerical_value", conv(hex($"binary_value"), 16, 10)).show()
// +-------------+---------------+
// | binary_value|numerical_value|
// +-------------+---------------+
// |[00 00 02 E6]|            742|
// +-------------+---------------+
Sign up to request clarification or add additional context in comments.

1 Comment

That's great! Feels a bit inefficient to go via text, but it's just what I was looking for. I tried to do some benchmarking but didn't find a significant difference. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.