Converting Spark Scala Dataframe Column to Byte Array

Question

I'm attempting to write a Spark Scala DataFrame Column as an array of bytes. I have a DataFrame that consists of two columns. The first column is a string and the second is a Map from Strings to Longs.

For example,

user_id | map
"ac2"   | Map("c2" -> 1, "b3" -> 5)

I want to write the map column as an array of bytes. So far I've attempted to use Jackson with the following UDF:

val writeJackson = udf { x: Map[String, Long] =>
    jacksonWriter.writeValueAsBytes(x)
}

val df2 = df.withColumn("jacksonMap", writeJackson($"map"))

but this fails because of

java.io.NotSerializableException: com.fasterxml.jackson.module.paranamer.shaded.CachingParanamer

Is there a way to get this to work with Jackson, and if not is there a different library that will let me write this Spark column as a byte array?

vindev · Accepted Answer · 2018-02-06 05:28:29Z

0

I am able to convert to ByteArray and get the output with the following code. Using spark 1.6.2.

object DF {

  def main(args: Array[String]): Unit = {

    val mapper: ObjectMapper = new ObjectMapper
    mapper.registerModule(DefaultScalaModule)

    val df = Seq(
      ("ac2", Map("c2" -> 1, "b3" -> 5))
    ).toDF("id", "map")

    df.show(false)
    //output
    // +---+---------------------+
    // |id |map                  |
    // +---+---------------------+
    // |ac2|Map(c2 -> 1, b3 -> 5)|
    // +---+---------------------+
    val getByteArray = udf((map: Map[String, Int]) => mapper.writeValueAsBytes(map))

    df.withColumn("bytearray", getByteArray($"map")).show(false)

    //output
    // +---+---------------------+----------------------------------------------+
    // |id |map                  |bytearray                                     |
    // +---+---------------------+----------------------------------------------+
    // |ac2|Map(c2 -> 1, b3 -> 5)|[7B 22 63 32 22 3A 31 2C 22 62 33 22 3A 35 7D]|
    // +---+---------------------+----------------------------------------------+
  }
}

edited Feb 6, 2018 at 5:28

answered Feb 6, 2018 at 5:23

vindev

2,2802 gold badges15 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Regressor Over a year ago

how do u import ObjectMapper ? and what is DefaultScalaModule

vindev Over a year ago

Add this to pom.xml

<dependency>     <groupId>com.fasterxml.jackson.core</groupId>     <artifactId>jackson-databind</artifactId>     <version>2.9.5</version> </dependency>

Regressor Over a year ago

thank you for that. Can i use utilise this UDF to convert a Spark ArrayType Column to ByteArray, instead of Map?

vindev Over a year ago

Yes, you can just change the type in UDF with your appropriate type.

Regressor Over a year ago

val convertToByteArray = udf((map: Array[String]) => mapper.writeValueAsBytes(map))     val arrayDF = Seq(       ("x100", Array("p1","p2","p3"))     ).toDF("id", "myarray")      arrayDF.withColumn("bytearray", convertToByteArray($"myarray")).show(false)

This is throwing me an error

Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(anonfun$3: (array<string>) => binary)

|

Collectives™ on Stack Overflow

Converting Spark Scala Dataframe Column to Byte Array

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related