0

I am trying to convert Array of Strings to byte-array in Spark and re-converting the byte-array to Array of Strings.

However, I am not getting the String array back as I intend. Here is the code -

// UDFs for converting Array[String] to byte array and get back Array[String] from byte array
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.databind.ObjectMapper 

val mapper: ObjectMapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)

val convertToByteArray = udf((map: Seq[String]) => mapper.writeValueAsBytes(map))
val convertToString = udf((a: Array[Byte])=> new String(a))

val arrayDF = Seq(
  ("x100", Array("p1","p2","p3","p4"))
).toDF("id", "myarray")
arrayDF.printSchema()
root
 |-- id: string (nullable = true)
 |-- myarray: array (nullable = true)
 |    |-- element: string (containsNull = true)
arrayDF.show(false)
+----+----------------+
|id  |myarray         |
+----+----------------+
|x100|[p1, p2, p3, p4]|
+----+----------------+

val converted = arrayDF.withColumn("bytearray", convertToByteArray($"myarray")).select($"id",$"bytearray")
converted.printSchema()
root
 |-- id: string (nullable = true)
 |-- bytearray: binary (nullable = true)
converted.show(false)
+----+----------------------------------------------------------------+
|id  |bytearray                                                       |
+----+----------------------------------------------------------------+
|x100|[5B 22 70 31 22 2C 22 70 32 22 2C 22 70 33 22 2C 22 70 34 22 5D]|
+----+----------------------------------------------------------------+

val getBack = converted.withColumn("getstring", convertToString($"bytearray")) 
getBack.printSchema()
root
 |-- id: string (nullable = true)
 |-- bytearray: binary (nullable = true)
 |-- getstring: string (nullable = true)
getBack.show(false)
+----+----------------------------------------------------------------+---------------------+
|id  |bytearray                                                       |getstring            |
+----+----------------------------------------------------------------+---------------------+
|x100|[5B 22 70 31 22 2C 22 70 32 22 2C 22 70 33 22 2C 22 70 34 22 5D]|["p1","p2","p3","p4"]|
+----+----------------------------------------------------------------+---------------------+

However, I want my final results as -

+----+----------------------------------------------------------------+---------------------+
|id  |bytearray                                                       |getstring            |
+----+----------------------------------------------------------------+---------------------+
|x100|[5B 22 70 31 22 2C 22 70 32 22 2C 22 70 33 22 2C 22 70 34 22 5D]|[p1,p2,p3,p4]|
+----+----------------------------------------------------------------+---------------------+

Here is the pom.xml that I use for creating byte-array

<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-core</artifactId>
    <version>2.9.5</version>
</dependency>

1 Answer 1

0

you take a list of strings and and treat it as a single object and at the conversion back you treat it as if was just a string- if you want a single string back you also need to convert the list to a string:

val convertToByteArray = udf((map: Seq[String]) => mapper.writeValueAsBytes(map.mkString("[",",","]")))
Sign up to request clarification or add additional context in comments.

6 Comments

It still returns me as string, not an array of string. From your code, if I re-convert the byte array to string, I get "[p1,p2,p3,p4]"
because that's what you're doing you are getting a bytearray and converting to a string and in the initial conversion you're doing a poor man's serialization of a string to a bytearray it wouldn't automagically serialize your Seq[String]
thank u for the explanation. So what is the correct way to serialize the Seq[String] such that I can get it back from the byte array?
you need a serializer e.g. Kyro (github.com/twitter/chill) or ProtoBuff etc.
alternatively you can use the conversion to string and later split the string and convert back to a sequence
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.