Find match over Array of RegEx in MongoDB Collection

Question

Say I have a collection with these fields:

{
    "category" : "ONE",
    "data": [
        {
            "regex": "/^[0-9]{2}$/",
            "type" : "TYPE1"
        },
        {
            "regex": "/^[a-z]{3}$/",
            "type" : "TYPE2"
        }
        // etc
    ]
}

So my input is "abc" so I'd like to obtain the corresponding type (or best match, although initially I'm assuming RegExes are exclusive). Is there any possible way to achieve this with decent performance? (that would be excluding iterating over each item of the RegEx array)

Please note the schema could be re-arranged if possible, as this project is still in the design phase. So alternatives would be welcomed.

Each category can have around 100 - 150 RegExes. I plan to have around 300 categories. But I do know that types are mutually exclusive.

Real world example for one category:

type1=^34[0-9]{4}$, 
type2=^54[0-9]{4}$, 
type3=^39[0-9]{4}$, 
type4=^1[5-9]{2}$, 
type5=^2[4-9]{2,3}$

How many patterns or types you think you will have: 10, 100, 1000, more? — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Oct 17, 2014 at 20:08
MongoDB doesn't have a feature for searching for regex by matching string. The only way to know if a regex will match a string is to see if it matches, so you'll have to check each regexp against your input string. — wdberkeley
– wdberkeley, Commented Oct 20, 2014 at 16:19
So far you have shown X amount of letters and numbers. The only remaining text items are punctuation. When you show a richer sampling of regex, I would say that you cannot define unique regex. I'd say there has to be overlap somewhere, yielding probably several types. So your assumption of exclusivity is basically invalid. Once you realize that you will realize that you will have to scan with each regex seperately, can't be avoided. The problem then becomes to assign value (weight) each type relative to other types in the weighted sequence of searches. — user557597
– user557597, Commented Oct 20, 2014 at 18:03
By saving RegExps in MongoDB as Strings, you WILL have to test each of them against your search string and mongoDB has some magic to help you with this (integrated ECMAscript in the core) — CSᵠ
– CSᵠ, Commented Oct 21, 2014 at 9:00
More info about this problem, including the usecase would help in getting a tailored answer... — CSᵠ
– CSᵠ, Commented Oct 21, 2014 at 9:13

CSᵠ · Accepted Answer · 2014-10-21 19:36:03Z

Describing the RegEx (Divide et Impera) would greatly help in limiting the number of Documents needed to be processed.

Some ideas in this direction:

RegEx accepting length (fixed, min, max)
POSIX style character classes ([:alpha:], [:digit:], [:alnum:], etc.)
Tree like Document structure (umm)

Implementing each of these would add to the complexity (code and/or manual input) for Insertion and also some overhead for describing the searchterm before the query.

Having mutually exclusive types in a category simplifies things, but what about between categories?

300 categories @ 100-150 RegExps/category => 30k to 45k RegExps

... some would surely be exact duplicates if not most of them.

In this approach I'll try to minimise the total number of Documents to be stored/queried in a reversed style vs. your initial proposed 'schema'.
_{Note: included only string lengths in this demo for narrowing, this may come naturally for manual input as it could reinforce a visual check over the RegEx}

Consider rewiting the regexes Collection with Documents as follows:

{
   "max_length": NumberLong(2),
   "min_length": NumberLong(2),
   "regex": "^[0-9][2]$",
   "types": [
     "ONE/TYPE1",
     "NINE/TYPE6"
  ]
},
{
   "max_length": NumberLong(4),
   "min_length": NumberLong(3),
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
{
   "max_length": NumberLong(6),
   "min_length": NumberLong(6),
   "regex": "^39[0-9][4]$",
   "types": [
     "ONE/TYPE3",
     "SIX/TYPE2"
  ]
},
{
   "max_length": NumberLong(3),
   "min_length": NumberLong(3),
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
}

.. each unique RegEx as it's own document, having Categories it belongs to (extensible to multiple types per category)

Demo Aggregation code:

function () {

   match=null;
   query='abc';

   db.regexes.aggregate(
    {$match: {
        max_length: {$gte: query.length},
        min_length: {$lte: query.length},
        types: /^ONE\//
        }
    },
    {$project: {
        regex: 1, 
        types: 1, 
        _id:0
        }
    }
   ).result.some(function(re){ 
       if (query.match(new RegExp(re.regex))) return match=re.types;
   });
   return match;
}

Return for 'abc' query:

[
   "ONE/TYPE2"
]

this will run against only these two Documents:

{
   "regex": "^2[4-9][2,3]$",
   "types": [
     "ONE/TYPE5",
     "TWO/TYPE2",
     "SIX/TYPE8"
  ]
},
 {
   "regex": "^[a-z][3]$",
   "types": [
     "ONE/TYPE2"
  ]
}

narrowed by the length 3 and having the category ONE.

Could be narrowed even further by implementing POSIX descriptors (easy to test against the searchterm but have to input 2 RegExps in the DB)

Collectives™ on Stack Overflow

Find match over Array of RegEx in MongoDB Collection

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related