1

What is the best way to find duplicate lines (that repeats more than one time) in string that contains more than 10 M lines? (I just try to keep array as string to save memory)

For example:

input:

userId256
userId512
userId64
userId256
userId128
userId128
userId128
userId8
userId4
...

output:

userId256
userId128

I would use split("\n") and then work with array, but may be there is more optimum way to work with big string values.

To get unique lines I used:

let result = Array.from(new Set(bigString.split("\n")))
        .join("\n")
        .toString(); 

Works fast for 10M lines. Is there any way to get deplicates using Set in JS?

10
  • This is in node.js? In like a 'readFile' context? Commented Dec 17, 2019 at 17:42
  • No, it's in browser. I'm using one social media's api to get user ids. Commented Dec 17, 2019 at 17:45
  • Ah, if it was in node through a file I'd potentially have a moderately better solution than splitting. In a browser, I'm not so sure Commented Dec 17, 2019 at 17:47
  • What is the actual goal? Find a list that has dupes? Commented Dec 17, 2019 at 17:49
  • Is your problem about the split method or working with the arrays? Commented Dec 17, 2019 at 17:53

6 Answers 6

2

I'm not sure the performance is better, you'll need to check to profile it and compare with array based solutions.

You can use a RegExp with a capture group, positive lookahead and a back reference to find line that are repeated (regex101). The result will be all lines that have a line with the same content after them.

However, if the lines are repeated more than twice (userId128 for example), the result will include all repetitions, but the last. To get only unique items, convert to a Set, and spread back to an array.

const str = `userId256
userId512
userId64
userId256
userId128
userId128
userId128
userId8
userId4`

const result = [...new Set(str.match(/^(.+)$(?=[\s\S]+\1)/gm))]

console.log(result)

Sign up to request clarification or add additional context in comments.

1 Comment

I was thinking +\1 was not supported with JavaScript, I was about to check it out but you beat me to the punch.
0

To avoid intermediate copies of the data, you could create a generator function that will hand you back an iterable:

function* split() {
  let re = /(^|\r?\n)(.*?)(?=(\r?\n|$))/g;
  for (;;) {
    const result = re.exec(data);
    if (result) {
      yield result[2];
    } else {
      break;
    }
  }
}

If I'm correctly understanding JavaScript regex, this will chip lines out of the big string, one at a time.

then you could supply this iterable as a parameter to the Set constructor for deduplication:

const deduped = new Set(split(data))

However, with the addition of a second generator function:

function* dedupe(iterable) {
  const s = new Set();
  for (const v of iterable) {
    if (!s.has(v)) {
      s.add(v);
      yield v;
    }
  }
}

It's now possible to prune duplicates from the get-go, rather than needing to build a monolithic set with many items, up-front.

So now you'd

 const theIterator = dedupe(split(data))

and you'd be able to pick through each one with a for..of loop (without the up-front cost of creating a huge set/array):

 for(const line of theIterator){
     // do something with line.
 }

Edit (and shameless plug):

My library, blinq, makes it easy to create a histogram from iterable data.

So

 import {blinq} from "blinq"
 //...
 const dataIterable = split(data)
 const histogram = 
     blinq(dataIterable)
         .groupBy(x => x)
         .select(g => ({key:g.key, count:g.count()}))

3 Comments

Thank you! Great idea to get unique elemnts! But how can I get the list of duplicates using generators?
I would make iterator and walk all lines and build a frequency table, like @jerch adviced.
@AlexKorzin I have a library that provides some nice syntax for working with iterables: After creating the first iterator function, it would be as easy as blinq(split(data)).groupBy(x=>x).select(g=>({key:g.key,count:g.count()})) Take a look: github.com/biggyspender/blinq
0

EDIT

There isn't a way to magically identify the duplicates without accessing each element. Now it is a question of where you want to do it. This is particulary well done on a backend, since the user experience will not take a hit. If you still want it on the browser, you can use a setTimeout or use webworkers to reduce the impact on the user interface.

Actual answer

You must be able to use a reduce function for this.

const str = `your big string`;
const data = str.split("\n");
let duplicate = data.reduce((acc,currentValue,index, array) => {
  if(array.indexOf(currentValue)!=index && !acc.includes(currentValue)) acc.push(currentValue);
  return acc;
}, []);
console.log(`Duplicates are now in the array ${duplicate}`);

1 Comment

I am asking the user now =)
0

Well that depends on what you wanna optimize for (time / space / code complexity) and on some circumstances:

  • easy way:
    Walk all lines and do an exhaustive indexOf search. Whenever you get multiple results, memorize the line content. This almost needs no additional space, but has a bad runtime of O(n²).
  • runtime optimized:
    Walk all lines and build a frequency table. Runs in O(n), but needs a temporary table object, where you count the occurences (worst case ~doubled space).
  • mixture of both:
    Use a bloomfilter. Basically runs in O(n) (+some additional work if a match was found), and needs not much additional space. But note that bloomfilters are tricky to set up, with wrong storage size or bad hash functions it will drop back to O(n²). This has the highest code complexity, but might pay off for such a big amount of data. Beside that bloomfilter might perform poor, if the probability of duplicates is rather high.

Comments

0

If it were me, I'd make a dictionary of the keys; and then the ones that fail to add are duplicates.

var f = myString.split('\n');
var dict = {};
for (i = 0; i < f.length; i++) {
  if !(f[i] in dict) {
    dict.push({f[i] : 1});
  } else {

    //this is a duplicate, do with what you will.
    //in this, I'll just increment it.
    dict[f[i]] += 1;
  }
}

Comments

0

You can find duplicates by counting item and then filter by count:

const result = a.reduce((a, c)=> {
    a[c] = a[c] || {count: 0};
    a[c].count +=1;
    return a;
}, {});

const duplicates = Object.entries(result)
    .filter(([k, v]) => v.count > 1 )
    .map(([e])=> e);

An example:

const a = [
        'userId256',
        'userId512',
        'userId64',
        'userId256',
        'userId128',
        'userId128',
        'userId128',
        'userId8',
        'userId4'
          ];

    const result = a.reduce((a, c)=> {
        a[c] = a[c] || {data:[]};
        a[c].data.push(c);
        return a;
    }, {});

    const duplicates = Object.entries(result)
        .filter(([k, v]) => v.data.length > 1 )
        .map(([e])=> e);


    console.log(duplicates);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.