0

I need to write a for-loop with object mutation in Scala. In machine learning, when clustering (distributing samples into optimally separate groups), in order to decide on optimal number of groups in a set, clustering algorithm is run with different group numbers, some error metric is calculated for each group number. Optimal group number is where graph of number of groups make an elbow against error metric. In Spark ML library, a KMeans object is used to cluster where group number is passed as a parameter. So, I calculate error metric to draw elbow graph as follows:

var baseClusterer = new KMeans()
                   .setFeaturesCol("scaledFeatures")
                   .setPredictionCol("clusters")
                   .setSeed(0)


2 to 10 map {
   baseClusterer = baseClusterer.setK(k)
   baseClusterer.fit(scaledDF).computeCost(scaledDF)
}

I have to declare clusterer object as a var and mutate it every iteration. Is there a more scala way to write this?

3 Answers 3

3

You can avoid var doing this way:

2 to 10 map { k =>
     baseClusterer.setK(k).fit(scaledDF).computeCost(scaledDF)
}
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the response. This is more scala-like. I would still have to declare baseClusterer as a var. I know that rule-of-thumb is to try to declare val whenever possible. And object mutation does not seem scala-like. It just bothered me that I could not write it without object mutation, that's why I asked the question.
You can declare baseClusterer as val. but Tim's approach is even more scala-way
@BogdanVakulenko no, your answer is better than Tim's (actually, his does not work at all) ... I explained why n the comment to his answer.
@Dima, Does it mean baseCluster will be mutated, and baseCluster is available result is available, foreach will be much more clear, as foreach is meant for side effects.
Foreach is meant for intended side effects. Here the result you are after is the value returned by computeCost. The side effect is just an unfortunate artifact of spark's library implementation.
2

Note: This version is modified from the original based on the comments

If you are going to repeat this operation on different data you might want to consider creating a list of clusterers and then using that:

val clusterers = (2 to 10).map(k =>
  new KMeans()
    .setFeaturesCol("scaledFeatures")
    .setPredictionCol("clusters")
    .setSeed(0)
    .setk(k)
)

val costs = clusterers.map(_.fit(scaledDF).computeCost(scaledDF))

But see the answer from @BogdanVakulenko for a good way to re-write the original version.

Also note that it is probably a good idea to use the same k multiple times with different setSeed values to avoid local minima.

4 Comments

(2 to 10).map(baseClusterer.setK) - it would be even more scala-way
I am not sure how KMeans is implemented, but setK sounds like it mutates the actual object, not copies it. If that's the case, this approach will end up with just doing the same thing 10 times. I think @BogdanVakulenko answer is better, as it does not make assumptions about implementation details of .setK
Just looked it up, it does indeed mutate the object, so this does not work. github.com/apache/spark/blob/…
@Dima Thanks for your comments, I have updated the answer to reflect your comments so it should now work correctly!
0

If I understand your logic correctly, maybe you could use foldLeft, in which every loop would return the modified/updated object, like this:

val finalClusterer = (2 to 10).foldLeft(baseClusterer) { (accum, elem) =>
    val newClusterer = accum.copy(k = k)
    newClusterer.fit(scaledDF).computeCost(scaledDF)
}

That way you would end up with a 'finalClusterer' in which you operated all the time having the base one as origin.

EDIT: My code uses the baseClusterer as a case class, hence the copy method. In case you don't have it as it seems to be a java class, maybe you could create an implicit class that acts as wrapper and you can define such method within it like this:

implicit class ClustererWrapper {
    def copy {
    ...
    }
}

1 Comment

Thanks for the response. KMeans is a scala class in SparkML library.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.