To avoid intermediate copies of the data, you could create a generator function that will hand you back an iterable:
function* split() {
let re = /(^|\r?\n)(.*?)(?=(\r?\n|$))/g;
for (;;) {
const result = re.exec(data);
if (result) {
yield result[2];
} else {
break;
}
}
}
If I'm correctly understanding JavaScript regex, this will chip lines out of the big string, one at a time.
then you could supply this iterable as a parameter to the Set constructor for deduplication:
const deduped = new Set(split(data))
However, with the addition of a second generator function:
function* dedupe(iterable) {
const s = new Set();
for (const v of iterable) {
if (!s.has(v)) {
s.add(v);
yield v;
}
}
}
It's now possible to prune duplicates from the get-go, rather than needing to build a monolithic set with many items, up-front.
So now you'd
const theIterator = dedupe(split(data))
and you'd be able to pick through each one with a for..of loop (without the up-front cost of creating a huge set/array):
for(const line of theIterator){
// do something with line.
}
Edit (and shameless plug):
My library, blinq, makes it easy to create a histogram from iterable data.
So
import {blinq} from "blinq"
//...
const dataIterable = split(data)
const histogram =
blinq(dataIterable)
.groupBy(x => x)
.select(g => ({key:g.key, count:g.count()}))