1

I am trying to perform regex query on an array of strings in MongoDB collection. I could only find this limitation in the docs:

$regex can only use an index efficiently when the regular expression has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.

Let's make a test:

> for (var i=0; i<100000; i++) db.test.insert({f: ['a_0_'+i, 'a_1_2']})
> db.test.count()
100000
> db.test.ensureIndex({f: 1})
> db.test.find({f: /^a_(0)?_12$/ })
{ "_id" : ObjectId("514ac59886f004fe03ef2a96"), "f" : [ "a_0_12", "a_1_2" ] }
> db.test.find({f: /^a_(0)?_12$/ }).explain()
{
    "cursor" : "BtreeCursor f_1 multi",
    "isMultiKey" : true,
    "n" : 1,
    "nscannedObjects" : 200000,
    "nscanned" : 200000,
    "nscannedObjectsAllPlans" : 200000,
    "nscannedAllPlans" : 200000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 482,
    "indexBounds" : {
        "f" : [
            [
                "a_",
                "a`"
            ],
            [
                /^a_(0)?_12$/,
                /^a_(0)?_12$/
            ]
        ]
    },
    "server" : "someserver:27017"
}

The query is sloooow. On the other hand, this query is optimal: (but doesn't suit my use case)

> db.test.find({f: 'a_0_12' }).explain()
{
    "cursor" : "BtreeCursor f_1",
    "isMultiKey" : true,
    "n" : 1,
    "nscannedObjects" : 1,
    "nscanned" : 1,
    "nscannedObjectsAllPlans" : 1,
    "nscannedAllPlans" : 1,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "f" : [
            [
                "a_0_12",
                "a_0_12"
            ]
        ]
    },
    "server" : "someserver:27017"
}

Why is regex query scanning all (sub)records when it has an index? What am I missing?

6
  • It can't be fast. Other than the simple optimization they make for "starts with", the search would need to look at every document that states with those characters and run the regular expression on each. Commented Mar 21, 2013 at 10:43
  • Why would that be so? At least for this regex I see no reason why it can't be optimized. Granted, regex state engine needs to be coupled with search, but I thought this is what MongoDB is doing. If not, does it just take a part before "(" and use an index for that, while traversing and checking each other option one by one? This is just... well, "not optimal". :) Can someone confirm this? Commented Mar 21, 2013 at 14:20
  • 1
    The Mongodb index contains the full text string values of the field f. It's nothing more than that. It's checking the start of the string as it's rooted, and then it needs to run the RegEx on each document that matches the root. a_ is the only part that it can use as the root. Each doc that matches the root must be run against the remainder of the regex. You might be able to optimize it by storing the second part as another field in your document rather than embedding it in the string. (And then creating a compound index on both keys). Commented Mar 21, 2013 at 19:15
  • I see your point and the results show that this is indeed so - thanks for clarification! I am a bit disappointed at the implementation though (it could be made much "smarter" with a few simple tweaks). This workaround works though: db.test.find({'$or': [{f: /^a_0_12$/}, {f: /^a__12$/}] }).explain() (and this is what i would expect query engine to do automatically). Thanks! Commented Mar 22, 2013 at 11:26
  • I wouldn't expect it to do that automatically at all. :) Commented Mar 22, 2013 at 12:18

1 Answer 1

1

Your test case has several characteristics that are unhelpful for regex and index usage:

  • each document includes an array of two values both starting with "a_". Your regex /^a_(0)?_12$/ is looking for a string starting with a followed by an optional "0", so leads to a comparison of all index entries (200k values).
  • your regex also matches a value that every document has (a_1_2), so will end up matching all documents irrespective of the index

Since you have a multikey (array index), the number of index comparisons is actually worse than just doing a full table scan of the 100k documents. You can test with a $natural hint to see:

db.test.find({f: /^a_(0|)12$/ }).hint({$natural:1}).explain()
{
    "cursor" : "BasicCursor",
    "isMultiKey" : false,
    "n" : 0,
    "nscannedObjects" : 100000,
    "nscanned" : 100000,
    "nscannedObjectsAllPlans" : 100000,
    "nscannedAllPlans" : 100000,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 192,
    "indexBounds" : {

    },
}

More random data or a more selective regex will result in fewer comparisons.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.