1

I have a collection which holds more than 15 million documents. Out of those 15 million documents I update 20k records every hour. But update query takes a long time to finish (30 min around).

Document:

{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10}

I have an array which holds 20k instances to be updated.

My Query looks like this:

For h in hourly filter h.dt == DATE_ISO8601(14501160000000) 
   For i in instArr
      filter i.inst == h.inst
      update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly

Is there any optimized way of doing this. I have hash indexing on inst and skiplist indexing on dt.

Update

I could not use 20k inst in the query manually so following is the execution plan for just 2 inst:

FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN 
 [{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst == 
 r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg, 
 "samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst

Execution plan:
 Id   NodeType              Est.   Comment
  1   SingletonNode            1   * ROOT
  5   CalculationNode          1     - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ]   /* json expression */   /* const assignment */
 13   IndexRangeNode      103067     - FOR r IN hourly   /* skiplist index scan */
  6   EnumerateListNode   206134       - FOR i IN #6   /* list iteration */
  7   CalculationNode     206134         - LET #8 = i.`inst` == r.`inst`   /* simple expression */   /* collections used: r : hourly */
  8   FilterNode          206134         - FILTER #8
  9   CalculationNode     206134         - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` }   /* simple expression */
 10   UpdateNode          206134         - UPDATE r WITH #10 IN hourly
 11   CalculationNode     206134         - LET #12 = $NEW.`inst`   /* attribute expression */
 12   ReturnNode          206134         - RETURN #12

Indexes used:
 Id   Type       Collection   Unique   Sparse   Selectivity Est.   Fields   Ranges
 13   skiplist   hourly       false    false                 n/a   `dt`     [ `dt` == "2015-12-14T18:00:00.000Z" ]

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   remove-data-modification-out-variables
  6   use-index-range
  7   remove-filter-covered-by-index

Write query options:
 Option                   Value
 ignoreErrors             true
 waitForSync              false
 nullMeansRemove          false
 mergeObjects             true
 ignoreDocumentNotFound   false
 readCompleteInput        true

5
  • I guess instArr is the mentioned array with 20k instances? Are the array values known when the query starts? Or is it calculated somewhere in the query and not shown? Are the array values unique? Does the execution plan show the query uses indexes, and which? Commented Dec 16, 2015 at 10:19
  • instArr is known before the query starts. It's array of unique values and it's length is 20k. I am using arangodb 2.5.7 and can not upgrade from that. I didn't try the execution plan. Most of the documentation on execution plan is for latest version. Not sure which command to run in 2.5.7 for execution plan. Commented Dec 16, 2015 at 13:58
  • The execution plan for a query can be retrieved via require("org/arangodb/aql/explainer").explain(queryString); in the ArangoShell. If there are bind parameters in the query, you can use require("org/arangodb/aql/explainer").explain({ query: queryString, bindVars: bindVars });. This should be same for 2.5 and newer versions. Commented Dec 16, 2015 at 19:20
  • I Edited the post with execution plan. Commented Dec 16, 2015 at 21:23
  • Did the answer suit your needs? If, would you mind marking it as accepted? Or else, whats missing? Commented Jan 11, 2016 at 12:20

1 Answer 1

4

I assume the selection part (not the update part) will be the bottleneck in this query.

The query seems problematic because for each document matching the first filter (h.dt == DATE_ISO8601(...)), there will be an iteration over the 20,000 values in the instArr array. If instArr values are unique, then only one value from it will match. Additionally, no index will be used for the inner loop, as the index selection has happened in the outer loop already.

Instead of looping over all values in instArr, it will be better to turn the accompanying == comparison into an IN comparison. That would already work if instArr would be an array of instance names, but it seems to be an array of instance objects (consisting of at least attributes inst and count). In order to use the instance names in an IN comparison, it would be better to have a dedicated array of instance names, and a translation table for the count and dt values.

Following is an example for generating these with JavaScript:

var instArr = [ ], trans = { }; 
for (i = 0; i < 20000; ++i) { 
  var instance = "instance" + i;
  var count = Math.floor(Math.random() * 10);
  var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
  instArr.push(instance);        
  trans[instance] = [ count, dt ];  
} 

instArr would then look like this:

[ "instance0", "instance1", "instance2", ... ]

and trans:

{ 
  "instance0" : [ 4, "2015-12-16T21:24:45.106Z" ], 
  "instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
  "instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
  ...
}

These data can then be injected into the query using bind variables (named like the variables above):

FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN @trans[h.inst]

Note that ArangoDB 2.5 does not yet support the @trans[h.inst] syntax. In that version, you will need to write:

LET trans = @trans
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN trans[h.inst]

Additionally, 2.5 has a problem with longer IN lists. IN-list performance decreases quadratically with the length of the IN list. So in this version, it will make sense to limit the length of instArr to at most 2,000 values. That may require issuing multiple queries with smaller IN lists instead of just one with a big IN list.

The better alternative would be to use ArangoDB 2.6, 2.7 or 2.8, which do not have that problem, and thus do not require the workaround. Apart from that, you can get away with the slightly shorter version of the query in the newer ArangoDB versions.

Also note that in all of the above examples I used a RETURN ... instead of the UPDATE statement from the original query. This is because all my tests revealed that the selection part of the query is the major problem, at least with the data I had generated. A final note on the original version of the UPDATE: updating each document's inst value with i.inst seems redudant, because i.inst == h.inst so the value won't change.

Sign up to request clarification or add additional context in comments.

1 Comment

Looks like I wrote and posted my answer in parallel to your question's update... But it looks like my assumptions were correct: the query uses the skiplist index on dt for the outer loop, and will iterate over the 20K instArr values inside. This should definitely be changed to an IN lookup as suggested in my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.