Let's assume I have a very large iterable collection of values (in the order of 100,000 String entries, read from disk one by one), and I do something on its cartesian product (and write the result back to disk, though I won't show that here):
for(v1 <- values; v2 <- values) yield ((v1, v2), 1)
I understand that this is just another way of writing
values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))
This apparently causes the entire collection for each flatMap iteration (or even the entire cartesian product?) to be kept in memory. If you read the first version using the for loop this obviously is unnecessary. Ideally only two entries (the ones being combined) should be kept in memory at all times.
If I reformulate the first version like this:
for(v1 <- values.iterator; v2 <- values.iterator) yield ((v1, v2), 1)
memory consumption is a lot lower, leading me to assume that this version must be fundamentally different. What exactly does it do differently in the second version? Why does Scala not implicitly use iterators for the first version? Is there any speedup when not using iterators in some circumstances?
Thanks! (And also thanks to "lmm" who answered an earlier version of this question)
((v1, v2), 1)you build a new collection containing all those tuples. So indeed the entire carthesian product will have to be kept in memory, no?