2

I have a piece of code simply loops through two arrays and for each element of the first array, it finds the relevant element in the second array and changes only the first occurrence and delete the remaining ones.

 /**
     * The aggregation data structure:
     * "_id": {
     * "geometry": geometry,
     * "dups": [
     *    "5b3b25b4e54029249c459bfc", keep only the fisrt element in allDocs
     *    "5b3b25b4e54029249c459e65", delete it from allDocs
     *    "5b3b25b4e54029249c459d7d"   delete it from allDocs
     *   ],
     * "dupsProp": [  ], array of all properties of duplicatePoints
     * "count": 3
     */
var aggregationRes =[46,000 objects]
var allDocs =[345,000 objects]
aggregationRes.forEach(function (resElem, counter) {
        console.log(counter + "/" + aggregationRes.length)
        //Delete objects in allDocs based on dups array except the first one
        var foundIndex = allDocs.findIndex(x => x._id.toString() == resElem.dups[0]);
                //assign the mergedProperties
        allDocs[foundIndex].properties = resElem.dupsProp;
        //delete the remaining ids in Docs from dups array 
        resElem.dups.forEach(function (dupElem, index) {
            var tmpFoundIndex = allDocs.findIndex(x => x._id.toString() == resElem.dups[index + 1]);
            if (tmpFoundIndex !== -1) {
                allDocs.splice(tmpFoundIndex, 1)
            }
        })
    })

This script runs almost for 4 hours. As you see, the calculations are really simple but since the allDocs array is big, it takes quite a long time. It would be great if someone gives me a hint on how to decrease the computation time. Thanks in advance

7
  • 4
    Make allDocs indexed by id, so that you don't need to findIndex every time. And avoid doing many small splices on big arrays, rather mark them for deletion and then remove them all in one pass. Commented Aug 30, 2018 at 13:35
  • 1
    (Also: use a database for such large data :-) Commented Aug 30, 2018 at 13:36
  • 1
    Instead of doing operations like x._id.toString() 46K * 345K * (dupecount + 1) times, iterate the allDocs array once creating a new property which is this x._id.toString() (345K operations only). Is x._id the entire object? or some numeric/string ID? Commented Aug 30, 2018 at 13:39
  • Would it be possible to simply resElem.dups = [resElem.dups[0]]; instead of the resElem.dups.forEach iteration? You want to end up with a new array that only contains the first entry in the old array, right? Seems simple enough Commented Aug 30, 2018 at 13:43
  • 1
    Your aggregation data structure example is wrong/misleading. It looks like the _id property is the entire inner object. Is this node.js, and are you using mongodb (the first search result for "javascript ObjectID")? Might want to add those tags to your question if so. I would imagine there is a much more efficient way to do this using your database. Commented Aug 30, 2018 at 14:27

1 Answer 1

3

Taking the ideas from Bergi, we index the documents by id to avoid having to find the indexes which is expensive:

var allDocs =[345,000 objects]
var aggregationRes =[46,000 objects]
var allDocsIndexed = {};

allDocs.forEach(function(doc){
    allDocsIndexed[doc._id.toString()] = doc;
});

aggregationRes.forEach(function (resElem, counter) {
    allDocsIndexed[resElem.dups[0]].properties = resElem.dupsProp;
    for (var i = 1; i < resElem.dupsProp.length; i++) {
        delete allDocsIndexed[resElem.dupsProp[i]];
    }
});

var allUndeletedDocs = allDocs.filter(doc => allDocsIndexed.hasOwnProperty(doc_id.toString()));

Note that for javascript, this is an efficient solution but with more details provided, a better one might exist using mongodb features.

Sign up to request clarification or add additional context in comments.

2 Comments

Hey... your solution was perfect. unbelievable. wooooooow, I can not believe that . it was supppppppppper fast. just in less than a minute, it did the whole calculation which i used to do it in 3 to 4 hours. I did not know what people usually says about indexing an array but Now with your example I understood it very well. thanks a lot. Just a quick question: I did not used the last line (filtering) because i already have all the result. why do you actually filtered them ?
@MaryamKoulaei we only deleted the keys from allDocsIndexed, we never deleted anything from allDocs. So instead of deleting, I just filter to get the ones that were not deleted

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.