0

I have a Scala Spark DataFrame (Variable df):

id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...

I am trying to make use of RowMatrix to calculate pairwise cosine similarity efficiently.

final case class dataRow(id: String, values: Array[Double])

val rows = df.as[dataRow].map {
  row => {
        Vectors.dense(row.values)
    }
}.rdd

I am having the following compilation error

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ 

Eventually, I would be able to do this (RowMatrix requires an RDD[Vector])

val mat = new RowMatrix(rows)

I have already imported spark.implicits_, what am I doing wrong?

2
  • Does the id matter in any way? Note that RowMatrix will not keep the order between the rows, if that is important use IndexedRowMatrix instead. Commented Feb 12, 2019 at 9:04
  • @Shaido You're right, I do need the id so that I can identify which are the pairs. Commented Feb 12, 2019 at 10:05

2 Answers 2

1

There is simply no implicit Encoder for Vector types. So either push map after `rdd

val rows = df.as[dataRow].rdd.map(row => Vectors.dense(row.values))

or provide an Encoder

import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder

ds.as[dataRow].map(x => Vectors.dense(x.values))(ExpressionEncoder(): Encoder[Vector])
Sign up to request clarification or add additional context in comments.

Comments

0

Which Vectors object are you using?

Try importing the linalg context. There might be conflicts within the libraries.

Also move the case class domain object outside of your function scope and then remove final

import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix

case class DataRow(id: String, values: Array[Double])


def func(spark: SparkSession, df: DataFrame): RowMatrix = {
   import spark.implicits._

   val rows = df.as[DataRow]
      .map(row => Vectors.dense(row.values))
      .rdd

   val mat = new RowMatrix(rows)
   mat
}

5 Comments

I'm using import org.apache.spark.ml.linalg.{Vector, Vectors}. Case class is outside the scope, I am still getting the same compilation error
Which version of spark are you using? Odd, because your code very much resmbles that of the cosine similarity: github.com/apache/spark/blob/master/examples/src/main/scala/org/…
thats exactly what I'm trying to do with this library, except trying to overcome the encoding error.
@Ivan where do you insert 'import spark.implicits._'? can you show us in your example
@Ivan I think you can try selecting df and cast "id" as a string, and values as an Array[Double]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.