I have a collection which holds more than 15 million documents. Out of those 15 million documents I update 20k records every hour. But update query takes a long time to finish (30 min around).
Document:
{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10}
I have an array which holds 20k instances to be updated.
My Query looks like this:
For h in hourly filter h.dt == DATE_ISO8601(14501160000000)
For i in instArr
filter i.inst == h.inst
update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly
Is there any optimized way of doing this. I have hash indexing on inst and skiplist indexing on dt.
Update
I could not use 20k inst in the query manually so following is the execution plan for just 2 inst:
FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN
[{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst ==
r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg,
"samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
5 CalculationNode 1 - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ] /* json expression */ /* const assignment */
13 IndexRangeNode 103067 - FOR r IN hourly /* skiplist index scan */
6 EnumerateListNode 206134 - FOR i IN #6 /* list iteration */
7 CalculationNode 206134 - LET #8 = i.`inst` == r.`inst` /* simple expression */ /* collections used: r : hourly */
8 FilterNode 206134 - FILTER #8
9 CalculationNode 206134 - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` } /* simple expression */
10 UpdateNode 206134 - UPDATE r WITH #10 IN hourly
11 CalculationNode 206134 - LET #12 = $NEW.`inst` /* attribute expression */
12 ReturnNode 206134 - RETURN #12
Indexes used:
Id Type Collection Unique Sparse Selectivity Est. Fields Ranges
13 skiplist hourly false false n/a `dt` [ `dt` == "2015-12-14T18:00:00.000Z" ]
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 remove-data-modification-out-variables
6 use-index-range
7 remove-filter-covered-by-index
Write query options:
Option Value
ignoreErrors true
waitForSync false
nullMeansRemove false
mergeObjects true
ignoreDocumentNotFound false
readCompleteInput true
instArris the mentioned array with 20k instances? Are the array values known when the query starts? Or is it calculated somewhere in the query and not shown? Are the array values unique? Does the execution plan show the query uses indexes, and which?require("org/arangodb/aql/explainer").explain(queryString);in the ArangoShell. If there are bind parameters in the query, you can userequire("org/arangodb/aql/explainer").explain({ query: queryString, bindVars: bindVars });. This should be same for 2.5 and newer versions.