Converting a vector column in a dataframe back into an array column

Question

I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers.

+---+-----+
| id| dist| 
+---+-----+
|1.0|[2.0]|
|2.0|[4.0]|
|3.0|[6.0]|
|4.0|[8.0]|
+---+-----+

I tried using several variants of the following udf but it returns a type mismatch error

val toInt4 = udf[Int, Vector]({ (a) => (a)})  

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")

pwb2103 · Accepted Answer · 2016-10-24 20:48:17Z

10

I struggled for a while to get the answer from @ThomasLuechtefeld working. But was running into this very frustrating error:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type.

Turns out I needed to import DenseVector from the ml package instead of the mllib package.

So this worked for me:

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._

val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) }
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0)))

Yes, the only difference is that first line. This should absolutely be a comment, but I don't have the reputation. Sorry!

answered Oct 24, 2016 at 20:48

pwb2103

6107 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anders Eriksson Over a year ago

You could also consider doing val vectorToColumn = udf{ (x:org.apache.spark.ml.linalg.Vector, index: Int) => x(index) } as it will be able to handle both Dense and Sparse vectors.

Alberto Bonsanto · Accepted Answer · 2016-03-07 23:49:05Z

I think it's easiest to do it by going to the RDD API and then back.

import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext._

// The original data.
val input: DataFrame =
  sc.parallelize(1 to 4)
    .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2)))
    .toDF("id", "dist")

// Turn it into an RDD for manipulation.
val inputRDD: RDD[(Double, DenseVector)] =
  input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist"))

// Change the DenseVector into an integer array.
val outputRDD: RDD[(Double, Array[Int])] =
  inputRDD.mapValues(_.toArray.map(_.toInt))

// Go back to a DataFrame.
val output = outputRDD.toDF("id", "dist")
output.show

You get:

+---+----+
| id|dist|
+---+----+
|1.0| [2]|
|2.0| [4]|
|3.0| [6]|
|4.0| [8]|
+---+----+

Thomas Luechtefeld · Accepted Answer · 2016-09-22 20:51:24Z

3

In spark 2.0 you can do something like:

import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.functions.udf

val vectorHead = udf{ x:DenseVector => x(0) }
df.withColumn("firstValue", vectorHead(df("vectorColumn")))

answered Sep 22, 2016 at 20:51

Thomas Luechtefeld

1,49618 silver badges23 bronze badges

1 Comment

Thomas Luechtefeld Over a year ago

@pwb2103 mentions that the first line should be import org.apache.spark.ml.linalg.DenseVector

Collectives™ on Stack Overflow

Converting a vector column in a dataframe back into an array column

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related