2

Relatively new to scala and the Spark API kit but I have a question trying to make use of the vector assembler

http://spark.apache.org/docs/latest/ml-features.html#vectorassembler

to then make use of matrix correlations

https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations

The dataframe column is of dtype linalg.Vector

val assembler = new VectorAssembler()

val trainwlabels3 = assembler.transform(trainwlabels2)

trainwlabels3.dtypes(0)

res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7)

and yet calling this to an RDD for the statistics tool throws a mismatch error.

val data: RDD[Vector] = sc.parallelize(
  trainwlabels3("features")
) 

<console>:80: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Seq[org.apache.spark.mllib.linalg.Vector]

Thanks in advance for any help.

2 Answers 2

1

You should just select:

val features = trainwlabels3.select($"features")

Convert to RDD

 val featuresRDD = features.rdd

and map:

featuresRDD.map(_.getAs[Vector]("features"))
Sign up to request clarification or add additional context in comments.

Comments

0

This should work for you:

val rddForStatistics = new VectorAssembler()
   .transform(trainwlabels2)
   .select($"features")
   .as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
   .rdd

However, you should avoid RDDs and figure out how to do what you want with the DataFrame-based API (in the spark.ml package) because working with RDDs is all but deprecated in MLlib.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.