2

I've got a MongoDB collection, which looks like this:

// sites
// note that these urls all have paths, this is important.
// The path can only be longer, e.g. amazon.com/Coffee-Mug
[
  {
    name: "MySite",
    urls: ['google.com/search', 'amazon.com/Coffee', 'amazon.com/Mug']
  },
  {
    name: "OtherSite",
    urls: ['google.com/search', 'microsoft.com/en-us']
  }
]

What I'm trying to do is the following:

class Service {
  /**
   * @param url Is a full url, like "https://www.google.com/search?q=stackoverflow"
   * or "https://www.amazon.com/Coffee-Program-Ceramic-Makes-Programmers/dp/B07D2XJLLG/"
   */
  public async lookup(findUrl: string) {
    const trimmed = trim(findUrl); // remove variables and https, but NOT the path!
    // return the "Site" in which the base url is matched with the full url
    // see description below
  }
}

For example, using these cases

Case 1:

  • url = 'https://www.amazon.com/Coffee-Program-Ceramic-Makes-Programmers/dp/B07D2XJLLG/'
  • returned site(s): [MySite]

Case 2:

  • url = 'https://www.google.com/search?q=stackoverflow'
  • returned site(s): [MySite, OtherSite]

Case 3 (same as case 1 but with other value):

  • url = 'https://www.microsoft.com/en-us/surface'
  • returned site(s): [OtherSite]

Case 4 (when not to match):

  • url = 'https://microsoft.com/nl-nl' OR
  • url = 'https://microsoft.com'
  • returned site(s): []

I've tried to do something like this:

Site.find({ url: { $in: trimmed }})

Above kind of works, but the problem is, this only does exact matches. I want to match the url from MongoDB with the url provided by the function. How does one do this?

I've received the suggestion to use check if field is substring of a string or text search on MongoDB, but this is too inaccurate. I can basically enter the base domain without a path and it will find it, this is definitely not supposed to be happening.

4
  • Does this answer your question? check if field is substring of a string or text search on MongoDB Commented Nov 5, 2022 at 19:00
  • @TobiasS. This does not seem to work, it's way too inaccurate. The path of the subdomain is important to match as well Commented Nov 5, 2022 at 19:18
  • did you try it with the aggregation as the second part of the answer suggests? Commented Nov 5, 2022 at 19:23
  • @TobiasS. I did, but the regexMatch aggregation only works on strings, not string arrays. The match does not return anything at all. Using: js db.getCollection("sites").aggregate([ { $match: { $or: [ { urls: { $in: ["microsoft.com/en-us/dadadadada/blablabla"] } } ] } } ], { collation: { locale: "en", strength: 1 } }); Nonetheless, regexMatchAll seems to be a thing for arrays, but that's something I'm figuring out currently. Commented Nov 5, 2022 at 19:27

2 Answers 2

2
+100

One option is to use $reduce and $filter to iterate both over the sites and urls, matching them with regexMatch. Then, since the requested output is an array of urls, not array of objects, we can use $facet to handle the case of no matches at all.

db.collection.aggregate([
  {$match: {$expr: {
      $gt: [{
        $size: {
          $reduce: {
            input: urls_array,
            initialValue: [],
            in: {$concatArrays: [
                "$$value",
                {$filter: {
                    input: "$urls",
                    as: "url",
                    cond: {$regexMatch: {input: "$$this", regex: "$$url"}}
                 }}
             ]}
           }
         }
      }, 0]
  }}},
  {$facet: {res: [{$group: {_id: 0, res: {$addToSet: "$name"}}}]}},
  {$replaceRoot: {newRoot: {$mergeObjects: [{res: []}, {$first: "$res"}]}}}
])

See how it works on the playground example

Sign up to request clarification or add additional context in comments.

2 Comments

This is great! Is it also possible to return the entire collection of documents in res instead of just $name?
Sure, in that case use only the $match step and remove the two last steps. You will get an answer which is an array of original documents that match the conditions
2

I don't think of any straight way, You can use aggregation operators, but it will execute slowly because it will not use the index, if you really wanted to then I have added an approach below,

  • $expr enables to use of aggregation operators in the query part
  • $map to iterate loop of urls array, this will return a boolean values
  • $replaceAll to replace / to . in the URL string
  • $regexMatch to match the input string with an above-generated string that will input as a regular expression string, this will return true if the string pattern matches otherwise false
  • $in to check if the above value has true or not
Site.find({
  $expr: {
    $in: [
      true,
      {
        $map: {
          input: "$urls",
          in: {
            $regexMatch: {
              input: trimmed, // your input variable
              regex: {
                $replaceAll: {
                  input: "$$this",
                  find: "/",
                  replacement: "."
                }
              },
              options: "i"
            }
          }
        }
      }
    ]
  }
})

Playground

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.