0

I created a BucketedRandomProjectionLSHModel in order to find out the approximate nearest neighbours for every row in my dataset. The signature for the approximate nearest function is

def approxNearestNeighbors(
      dataset: Dataset[_],
      key: Vector,
      numNearestNeighbors: Int): Dataset[_] 

To run it on every row of the dataframe, my idea is to create some udf which calls this function and convert the resulting Dataset into a column of ArrayType[StructType].

Suppose my initial schema is

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)

My target schema (after calling .withColumn($"featureStdDev", udf...)) is

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)
 |-- neighbours: array(nullable = true)
      |-- elem: struct
           |-- genderIndex: double (nullable = false)
           |-- genderIndexVec: vector (nullable = true)
           |-- categoryIndex: double (nullable = false)
           |-- categoryIndexVec: vector (nullable = true)
           |-- features: vector (nullable = true)
           |-- featureStdDev: vector (nullable = true)

Please help with my UDF as I am not sure how to make it work.

val model = // BucketedRandomProjectionLSHModel definition
val inputDF = // Input definition
val nn = udf{ (featureVector: SparseVector, k: Int) =>
      model.approxNearestNeighbors(inputDF, featureVector, k)
      // What now...
    }
4
  • Possible duplicate of Using LSH in spark to run nearest neighbors query on every point in dataframe Commented Apr 11, 2019 at 1:56
  • @Shaido Approx Similarity Join does slightly different things. A similarity join against the df itself will either get an exact copy or a highly similar "neighbour", i.e. you cannot tell whether the feature vector has a similar neighbour within the df, but approx nearest neighbours can guarantee to find k (or less than k, which means there are not enough) similar neighbours. Commented Apr 11, 2019 at 9:45
  • 2
    Yes, for the similarity join you need to specify a threshold so there is no guarantee to get a specific number of (or any at all) neighbors. Unfortunatly as far as I can see, approxNearestNeighbors currently only support single vector inputs, which is why you want to use an UDF. The problem with the UDF appraoch is that you can't refer to a dataframe inside an UDF, see e.g. stackoverflow.com/questions/47509249/… Commented Apr 11, 2019 at 9:59
  • 1
    @Shaido That's very useful. I shall see if there is other ways than UDF which I can achieve the results. Commented Apr 11, 2019 at 10:37

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.