Best way to find duplicate lines in large string in JavaScript

Question

What is the best way to find duplicate lines (that repeats more than one time) in string that contains more than 10 M lines? (I just try to keep array as string to save memory)

For example:

input:

userId256
userId512
userId64
userId256
userId128
userId128
userId128
userId8
userId4
...

output:

userId256
userId128

I would use split("\n") and then work with array, but may be there is more optimum way to work with big string values.

To get unique lines I used:

let result = Array.from(new Set(bigString.split("\n")))
        .join("\n")
        .toString();

Works fast for 10M lines. Is there any way to get deplicates using Set in JS?

No, it's in browser. I'm using one social media's api to get user ids. — AlexKorzin
– AlexKorzin, Commented Dec 17, 2019 at 17:45
Ah, if it was in node through a file I'd potentially have a moderately better solution than splitting. In a browser, I'm not so sure — TKoL
– TKoL, Commented Dec 17, 2019 at 17:47
Is your problem about the split method or working with the arrays? — aksappy
– aksappy, Commented Dec 17, 2019 at 17:53

Ori Drori · Accepted Answer · 2019-12-17 18:25:22Z

2

I'm not sure the performance is better, you'll need to check to profile it and compare with array based solutions.

You can use a RegExp with a capture group, positive lookahead and a back reference to find line that are repeated (regex101). The result will be all lines that have a line with the same content after them.

However, if the lines are repeated more than twice (userId128 for example), the result will include all repetitions, but the last. To get only unique items, convert to a Set, and spread back to an array.

const str = `userId256
userId512
userId64
userId256
userId128
userId128
userId128
userId8
userId4`

const result = [...new Set(str.match(/^(.+)$(?=[\s\S]+\1)/gm))]

console.log(result)

edited Dec 17, 2019 at 18:25

answered Dec 17, 2019 at 18:13

Ori Drori

196k32 gold badges243 silver badges233 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

epascarello Over a year ago

I was thinking +\1 was not supported with JavaScript, I was about to check it out but you beat me to the punch.

spender · Accepted Answer · 2019-12-17 22:45:58Z

0

To avoid intermediate copies of the data, you could create a generator function that will hand you back an iterable:

function* split() {
  let re = /(^|\r?\n)(.*?)(?=(\r?\n|$))/g;
  for (;;) {
    const result = re.exec(data);
    if (result) {
      yield result[2];
    } else {
      break;
    }
  }
}

If I'm correctly understanding JavaScript regex, this will chip lines out of the big string, one at a time.

then you could supply this iterable as a parameter to the Set constructor for deduplication:

const deduped = new Set(split(data))

However, with the addition of a second generator function:

function* dedupe(iterable) {
  const s = new Set();
  for (const v of iterable) {
    if (!s.has(v)) {
      s.add(v);
      yield v;
    }
  }
}

It's now possible to prune duplicates from the get-go, rather than needing to build a monolithic set with many items, up-front.

So now you'd

 const theIterator = dedupe(split(data))

and you'd be able to pick through each one with a for..of loop (without the up-front cost of creating a huge set/array):

 for(const line of theIterator){
     // do something with line.
 }

Edit (and shameless plug):

My library, blinq, makes it easy to create a histogram from iterable data.

So

 import {blinq} from "blinq"
 //...
 const dataIterable = split(data)
 const histogram = 
     blinq(dataIterable)
         .groupBy(x => x)
         .select(g => ({key:g.key, count:g.count()}))

edited Dec 17, 2019 at 22:45

answered Dec 17, 2019 at 18:18

spender

121k36 gold badges245 silver badges369 bronze badges

3 Comments

AlexKorzin Over a year ago

Thank you! Great idea to get unique elemnts! But how can I get the list of duplicates using generators?

AlexKorzin Over a year ago

I would make iterator and walk all lines and build a frequency table, like @jerch adviced.

spender Over a year ago

@AlexKorzin I have a library that provides some nice syntax for working with iterables: After creating the first iterator function, it would be as easy as blinq(split(data)).groupBy(x=>x).select(g=>({key:g.key,count:g.count()})) Take a look: github.com/biggyspender/blinq

aksappy · Accepted Answer · 2019-12-17 18:10:10Z

0

EDIT

There isn't a way to magically identify the duplicates without accessing each element. Now it is a question of where you want to do it. This is particulary well done on a backend, since the user experience will not take a hit. If you still want it on the browser, you can use a setTimeout or use webworkers to reduce the impact on the user interface.

Actual answer

You must be able to use a reduce function for this.

const str = `your big string`;
const data = str.split("\n");
let duplicate = data.reduce((acc,currentValue,index, array) => {
  if(array.indexOf(currentValue)!=index && !acc.includes(currentValue)) acc.push(currentValue);
  return acc;
}, []);
console.log(`Duplicates are now in the array ${duplicate}`);

edited Dec 17, 2019 at 18:10

answered Dec 17, 2019 at 17:49

aksappy

3,4503 gold badges26 silver badges51 bronze badges

1 Comment

aksappy Over a year ago

I am asking the user now =)

jerch · Accepted Answer · 2019-12-17 18:16:59Z

Well that depends on what you wanna optimize for (time / space / code complexity) and on some circumstances:

easy way:
Walk all lines and do an exhaustive indexOf search. Whenever you get multiple results, memorize the line content. This almost needs no additional space, but has a bad runtime of O(n²).
runtime optimized:
Walk all lines and build a frequency table. Runs in O(n), but needs a temporary table object, where you count the occurences (worst case ~doubled space).
mixture of both:
Use a bloomfilter. Basically runs in O(n) (+some additional work if a match was found), and needs not much additional space. But note that bloomfilters are tricky to set up, with wrong storage size or bad hash functions it will drop back to O(n²). This has the highest code complexity, but might pay off for such a big amount of data. Beside that bloomfilter might perform poor, if the probability of duplicates is rather high.

George Stocker · Accepted Answer · 2019-12-17 18:39:33Z

0

If it were me, I'd make a dictionary of the keys; and then the ones that fail to add are duplicates.

var f = myString.split('\n');
var dict = {};
for (i = 0; i < f.length; i++) {
  if !(f[i] in dict) {
    dict.push({f[i] : 1});
  } else {

    //this is a duplicate, do with what you will.
    //in this, I'll just increment it.
    dict[f[i]] += 1;
  }
}

answered Dec 17, 2019 at 18:39

George Stocker

58k29 gold badges184 silver badges239 bronze badges

Comments

StepUp · Accepted Answer · 2019-12-18 13:59:13Z

0

You can find duplicates by counting item and then filter by count:

const result = a.reduce((a, c)=> {
    a[c] = a[c] || {count: 0};
    a[c].count +=1;
    return a;
}, {});

const duplicates = Object.entries(result)
    .filter(([k, v]) => v.count > 1 )
    .map(([e])=> e);

An example:

const a = [
        'userId256',
        'userId512',
        'userId64',
        'userId256',
        'userId128',
        'userId128',
        'userId128',
        'userId8',
        'userId4'
          ];

    const result = a.reduce((a, c)=> {
        a[c] = a[c] || {data:[]};
        a[c].data.push(c);
        return a;
    }, {});

    const duplicates = Object.entries(result)
        .filter(([k, v]) => v.data.length > 1 )
        .map(([e])=> e);


    console.log(duplicates);

edited Dec 18, 2019 at 13:59

answered Dec 17, 2019 at 18:11

StepUp

38.5k16 gold badges95 silver badges159 bronze badges

Collectives™ on Stack Overflow

Best way to find duplicate lines in large string in JavaScript

To get unique lines I used:

6 Answers 6

1 Comment

Edit (and shameless plug):

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

To get unique lines I used:

6 Answers 6

1 Comment

Edit (and shameless plug):

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related