Create a column of ArrayType[StructType] from Dataframe in a UDF

Ask Question

Asked 6 years, 7 months ago

Modified 6 years, 7 months ago

Viewed 142 times

I created a BucketedRandomProjectionLSHModel in order to find out the approximate nearest neighbours for every row in my dataset. The signature for the approximate nearest function is

def approxNearestNeighbors(
      dataset: Dataset[_],
      key: Vector,
      numNearestNeighbors: Int): Dataset[_]

To run it on every row of the dataframe, my idea is to create some udf which calls this function and convert the resulting Dataset into a column of ArrayType[StructType].

Suppose my initial schema is

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)

My target schema (after calling .withColumn($"featureStdDev", udf...)) is

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)
 |-- neighbours: array(nullable = true)
      |-- elem: struct
           |-- genderIndex: double (nullable = false)
           |-- genderIndexVec: vector (nullable = true)
           |-- categoryIndex: double (nullable = false)
           |-- categoryIndexVec: vector (nullable = true)
           |-- features: vector (nullable = true)
           |-- featureStdDev: vector (nullable = true)

Please help with my UDF as I am not sure how to make it work.

val model = // BucketedRandomProjectionLSHModel definition
val inputDF = // Input definition
val nn = udf{ (featureVector: SparseVector, k: Int) =>
      model.approxNearestNeighbors(inputDF, featureVector, k)
      // What now...
    }

edited Apr 11, 2019 at 10:00

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

asked Apr 10, 2019 at 16:40

Zed Ekkes

393 bronze badges

Possible duplicate of Using LSH in spark to run nearest neighbors query on every point in dataframe

Shaido
– Shaido

2019-04-11 01:56:51 +00:00
Commented Apr 11, 2019 at 1:56
@Shaido Approx Similarity Join does slightly different things. A similarity join against the df itself will either get an exact copy or a highly similar "neighbour", i.e. you cannot tell whether the feature vector has a similar neighbour within the df, but approx nearest neighbours can guarantee to find k (or less than k, which means there are not enough) similar neighbours.

Zed Ekkes
– Zed Ekkes

2019-04-11 09:45:45 +00:00
Commented Apr 11, 2019 at 9:45
2

Yes, for the similarity join you need to specify a threshold so there is no guarantee to get a specific number of (or any at all) neighbors. Unfortunatly as far as I can see, approxNearestNeighbors currently only support single vector inputs, which is why you want to use an UDF. The problem with the UDF appraoch is that you can't refer to a dataframe inside an UDF, see e.g. stackoverflow.com/questions/47509249/…

Shaido
– Shaido

2019-04-11 09:59:35 +00:00
Commented Apr 11, 2019 at 9:59
1

@Shaido That's very useful. I shall see if there is other ways than UDF which I can achieve the results.

Zed Ekkes
– Zed Ekkes

2019-04-11 10:37:12 +00:00
Commented Apr 11, 2019 at 10:37

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Create a column of ArrayType[StructType] from Dataframe in a UDF

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked