Scala for loop and iterators

Question

Let's assume I have a very large iterable collection of values (in the order of 100,000 String entries, read from disk one by one), and I do something on its cartesian product (and write the result back to disk, though I won't show that here):

for(v1 <- values; v2 <- values) yield ((v1, v2), 1)

I understand that this is just another way of writing

values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))

This apparently causes the entire collection for each flatMap iteration (or even the entire cartesian product?) to be kept in memory. If you read the first version using the for loop this obviously is unnecessary. Ideally only two entries (the ones being combined) should be kept in memory at all times.

If I reformulate the first version like this:

for(v1 <- values.iterator; v2 <- values.iterator) yield ((v1, v2), 1)

memory consumption is a lot lower, leading me to assume that this version must be fundamentally different. What exactly does it do differently in the second version? Why does Scala not implicitly use iterators for the first version? Is there any speedup when not using iterators in some circumstances?

Thanks! (And also thanks to "lmm" who answered an earlier version of this question)

If you yield a ((v1, v2), 1) you build a new collection containing all those tuples. So indeed the entire carthesian product will have to be kept in memory, no? — Jasper-M
– Jasper-M, Commented Dec 10, 2014 at 15:52
Not necessarily, they are written right back to disk (using spark/HDFS). Otherwise it wouldn't scale too well :) — Johannes
– Johannes, Commented Dec 10, 2014 at 16:29

plinth · Accepted Answer · 2014-12-10 15:48:03Z

9

In Scala, yield does not produce a lazy sequence. My understanding is that you get all of the values at once so that you can index them all as a collection. For example, I had written the following for a ray tracer to generate rays:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height; x <- 0 until width)
    yield (x, y, aa((x, y)))
}

which failed spectacularly (out of memory) because it made all the rays up front (surprise!). By using the .iterator method, you're specifically asking for a lazy iterator. The above example can be modified to this:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height iterator; x <- 0 until width iterator)
    yield (x, y, aa((x, y)))
}

which works in a lazy manner.

answered Dec 10, 2014 at 15:48

plinth

49.5k11 gold badges84 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

wheaties Over a year ago

Beat me to it. Yes, this exactly! One comment, though, warn them that explicitly creating the iterator before using it in the for comprehension will not produce the same result they desire.

chiastic-security Over a year ago

Don't you mean that yield doesn't automatically produce a lazy sequence? It does if it's being constructed from something lazy (as you've shown here).

lmm · Accepted Answer · 2014-12-10 15:53:00Z

6

The first version is strictly evaluated; it creates a real, concrete collection with all those values in. The second "just" provides an Iterator, which lets you iterate over all the values; they'll be created as you actually perform the iteration.

The main reason Scala defaults to the first is because scala as a language allows side effects. If you write your two mappings as:

for(v1 <- values; v2 <- values) yield {println("hello"); ((v1, v2), 1)}
for(v1 <- values.iterator; v2 <- values.iterator) yield {
  println("hello"); ((v1, v2), 1)}

then what happens with the second may surprised you, especially in a larger application where the iterator might be created a long way away from where it's actually used.

A collection will perform better than an iterator if the map operation itself is expensive and you create it once and reuse it multiple times - the iterator has to recalculate the values each time, whereas the collection exists in memory. Arguably this makes the collection performance more predictable - it consumes a lot of memory, but it's the same amount whatever the collection is then used for.

If you want a collections library that is more willing to elide operations and optimize - perhaps because you already write all your code to be without side effects - you might want to consider Paul Philips' new effort.

answered Dec 10, 2014 at 15:53

lmm

17.4k3 gold badges28 silver badges37 bronze badges

2 Comments

Johannes Over a year ago

So while the first for comprehension is probably expanded to "values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))", what is the for comprehension using iterators expanded to? The same?

lmm Over a year ago

The same yes, but .flatMap on Iterator has a different implementation to the one on Array.

Collectives™ on Stack Overflow

Scala for loop and iterators

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related