Spark Dataframe of WrappedArray to Dataframe[Vector]

Question

I have a spark Dataframe df with the following schema:

root
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = false)

I would like to create a new Dataframe where each row will be a Vector of Doubles and expecting to get the following schema:

root
     |-- features: vector (nullable = true)

So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. Also, if there are too many rows the application will crash with a heap space exception.

val clustSet = df.rdd.map(r => {
          val arr = r.getAs[mutable.WrappedArray[Double]]("features")
          val features: Vector = Vectors.dense(arr.toArray)
          features
          }).map(Tuple1(_)).toDF()

I suspect that the instruction arr.toArray is not a good Spark practice in this case. Any clarification would be very helpful.

Thank you!

Community · Accepted Answer · 2017-05-23 11:47:22Z

4

It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.

It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.

You can do this very easy with UDFs:

import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
  Vectors.dense(array.toArray)
})
val withVector = dataset
  .withColumn("features", convertUDF('features))

Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD

However there author of the question didn't ask about differences

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered May 18, 2017 at 15:14

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user159941 Over a year ago

Thank you very much, that helped a lot and marked it as the answer. I can run more rows now and it is satisfying time-wise. I still get though an exception: org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 when I try 200,000 rows. Would you have an insight about this? Thanks again.

T. Gawęda Over a year ago

@user159941 Please check stackoverflow.com/questions/31947335/…

user159941 Over a year ago

I set in my code the following: val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryoserializer.buffer.max.mb","256") and it worked! Thank you.

Collectives™ on Stack Overflow

Spark Dataframe of WrappedArray to Dataframe[Vector]

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related