11

Let's assume I have a very large iterable collection of values (in the order of 100,000 String entries, read from disk one by one), and I do something on its cartesian product (and write the result back to disk, though I won't show that here):

for(v1 <- values; v2 <- values) yield ((v1, v2), 1)

I understand that this is just another way of writing

values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))

This apparently causes the entire collection for each flatMap iteration (or even the entire cartesian product?) to be kept in memory. If you read the first version using the for loop this obviously is unnecessary. Ideally only two entries (the ones being combined) should be kept in memory at all times.

If I reformulate the first version like this:

for(v1 <- values.iterator; v2 <- values.iterator) yield ((v1, v2), 1)

memory consumption is a lot lower, leading me to assume that this version must be fundamentally different. What exactly does it do differently in the second version? Why does Scala not implicitly use iterators for the first version? Is there any speedup when not using iterators in some circumstances?

Thanks! (And also thanks to "lmm" who answered an earlier version of this question)

2
  • If you yield a ((v1, v2), 1) you build a new collection containing all those tuples. So indeed the entire carthesian product will have to be kept in memory, no? Commented Dec 10, 2014 at 15:52
  • Not necessarily, they are written right back to disk (using spark/HDFS). Otherwise it wouldn't scale too well :) Commented Dec 10, 2014 at 16:29

2 Answers 2

9

In Scala, yield does not produce a lazy sequence. My understanding is that you get all of the values at once so that you can index them all as a collection. For example, I had written the following for a ray tracer to generate rays:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height; x <- 0 until width)
    yield (x, y, aa((x, y)))
}

which failed spectacularly (out of memory) because it made all the rays up front (surprise!). By using the .iterator method, you're specifically asking for a lazy iterator. The above example can be modified to this:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height iterator; x <- 0 until width iterator)
    yield (x, y, aa((x, y)))
}

which works in a lazy manner.

Sign up to request clarification or add additional context in comments.

2 Comments

Beat me to it. Yes, this exactly! One comment, though, warn them that explicitly creating the iterator before using it in the for comprehension will not produce the same result they desire.
Don't you mean that yield doesn't automatically produce a lazy sequence? It does if it's being constructed from something lazy (as you've shown here).
6

The first version is strictly evaluated; it creates a real, concrete collection with all those values in. The second "just" provides an Iterator, which lets you iterate over all the values; they'll be created as you actually perform the iteration.

The main reason Scala defaults to the first is because scala as a language allows side effects. If you write your two mappings as:

for(v1 <- values; v2 <- values) yield {println("hello"); ((v1, v2), 1)}
for(v1 <- values.iterator; v2 <- values.iterator) yield {
  println("hello"); ((v1, v2), 1)}

then what happens with the second may surprised you, especially in a larger application where the iterator might be created a long way away from where it's actually used.

A collection will perform better than an iterator if the map operation itself is expensive and you create it once and reuse it multiple times - the iterator has to recalculate the values each time, whereas the collection exists in memory. Arguably this makes the collection performance more predictable - it consumes a lot of memory, but it's the same amount whatever the collection is then used for.

If you want a collections library that is more willing to elide operations and optimize - perhaps because you already write all your code to be without side effects - you might want to consider Paul Philips' new effort.

2 Comments

So while the first for comprehension is probably expanded to "values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))", what is the for comprehension using iterators expanded to? The same?
The same yes, but .flatMap on Iterator has a different implementation to the one on Array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.