How to write for loop with object mutation Scala way?

Question

I need to write a for-loop with object mutation in Scala. In machine learning, when clustering (distributing samples into optimally separate groups), in order to decide on optimal number of groups in a set, clustering algorithm is run with different group numbers, some error metric is calculated for each group number. Optimal group number is where graph of number of groups make an elbow against error metric. In Spark ML library, a KMeans object is used to cluster where group number is passed as a parameter. So, I calculate error metric to draw elbow graph as follows:

var baseClusterer = new KMeans()
                   .setFeaturesCol("scaledFeatures")
                   .setPredictionCol("clusters")
                   .setSeed(0)


2 to 10 map {
   baseClusterer = baseClusterer.setK(k)
   baseClusterer.fit(scaledDF).computeCost(scaledDF)
}

I have to declare clusterer object as a var and mutate it every iteration. Is there a more scala way to write this?

Bogdan Vakulenko · Accepted Answer · 2018-10-01 13:01:51Z

3

You can avoid var doing this way:

2 to 10 map { k =>
     baseClusterer.setK(k).fit(scaledDF).computeCost(scaledDF)
}

answered Oct 1, 2018 at 13:01

Bogdan Vakulenko

3,3901 gold badge12 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gofrette Over a year ago

Thanks for the response. This is more scala-like. I would still have to declare baseClusterer as a var. I know that rule-of-thumb is to try to declare val whenever possible. And object mutation does not seem scala-like. It just bothered me that I could not write it without object mutation, that's why I asked the question.

Bogdan Vakulenko Over a year ago

You can declare baseClusterer as val. but Tim's approach is even more scala-way

Dima Over a year ago

@BogdanVakulenko no, your answer is better than Tim's (actually, his does not work at all) ... I explained why n the comment to his answer.

Shankar Shastri Over a year ago

@Dima, Does it mean baseCluster will be mutated, and baseCluster is available result is available, foreach will be much more clear, as foreach is meant for side effects.

Dima Over a year ago

Foreach is meant for intended side effects. Here the result you are after is the value returned by computeCost. The side effect is just an unfortunate artifact of spark's library implementation.

Tim · Accepted Answer · 2018-10-01 18:01:17Z

2

Note: This version is modified from the original based on the comments

If you are going to repeat this operation on different data you might want to consider creating a list of clusterers and then using that:

val clusterers = (2 to 10).map(k =>
  new KMeans()
    .setFeaturesCol("scaledFeatures")
    .setPredictionCol("clusters")
    .setSeed(0)
    .setk(k)
)

val costs = clusterers.map(_.fit(scaledDF).computeCost(scaledDF))

But see the answer from @BogdanVakulenko for a good way to re-write the original version.

Also note that it is probably a good idea to use the same k multiple times with different setSeed values to avoid local minima.

edited Oct 1, 2018 at 18:01

answered Oct 1, 2018 at 13:16

Tim

27.6k2 gold badges18 silver badges32 bronze badges

4 Comments

Bogdan Vakulenko Over a year ago

(2 to 10).map(baseClusterer.setK) - it would be even more scala-way

Dima Over a year ago

I am not sure how KMeans is implemented, but setK sounds like it mutates the actual object, not copies it. If that's the case, this approach will end up with just doing the same thing 10 times. I think @BogdanVakulenko answer is better, as it does not make assumptions about implementation details of .setK

Dima Over a year ago

Just looked it up, it does indeed mutate the object, so this does not work. github.com/apache/spark/blob/…

Tim Over a year ago

@Dima Thanks for your comments, I have updated the answer to reflect your comments so it should now work correctly!

GuilleK · Accepted Answer · 2018-10-01 13:05:53Z

0

If I understand your logic correctly, maybe you could use foldLeft, in which every loop would return the modified/updated object, like this:

val finalClusterer = (2 to 10).foldLeft(baseClusterer) { (accum, elem) =>
    val newClusterer = accum.copy(k = k)
    newClusterer.fit(scaledDF).computeCost(scaledDF)
}

That way you would end up with a 'finalClusterer' in which you operated all the time having the base one as origin.

EDIT: My code uses the baseClusterer as a case class, hence the copy method. In case you don't have it as it seems to be a java class, maybe you could create an implicit class that acts as wrapper and you can define such method within it like this:

implicit class ClustererWrapper {
    def copy {
    ...
    }
}

answered Oct 1, 2018 at 13:05

GuilleK

464 bronze badges

1 Comment

Gofrette Over a year ago

Thanks for the response. KMeans is a scala class in SparkML library.

Collectives™ on Stack Overflow

How to write for loop with object mutation Scala way?

3 Answers 3

5 Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related