0

I have a Node/NestJS backend application that uses MongoDB (with the Mongoose driver). For a "Get" function, I have set up an aggregation pipeline where first some "hard" filters can be applied, which filters out content entirely - and now I want some soft filters, which ranks search results and filters them out if they are irrelevant. This algorithm should use three fields on the document: title, description and tags. The title and the tags should have the most weight of those. Results would be excluded if the total relevance score falls below a certain threshold. Now, I have checked several other StackOverflow posts for this, for example this one, but they all seem to be about the "tags" field alone. A piece of doc I found suggested to use indexes for this, but I preferably want to do it through the aggregation framework, if I knew approximately how to do it.

Below is code from another application that demonstrates the functionality;

        do {
          let reg
          if (Array.isArray(searchString)) {
            reg = new RegExp(searchString[i], 'gi')
          } else {
            reg = new RegExp(searchString, 'gi')
          }
          for (const note of this.notes) {
            const countTitle = (note.title.match(reg) || []).length
            note.searchScore += countTitle

            let countTags = 0

            for (const tag of note.tags) {
              const tagLength = (tag.match(reg) || []).length
              countTags += tagLength
            }

            note.searchScore += countTags * 0.5

            const countContent = (note.content.match(reg) || []).length

            note.searchScore += countContent * 0.3
          }
          i++
        } while (!Array.isArray(searchString) && i < searchString.length)
        this.toDisplay = this.notes.filter(
          f => f.searchScore > 0 + searchString.length / 4
        )
        this.showNew = false
        this.sortUp = false
        this.sortItems('relevance')
      } else {
        this.updateUI()
      }
    }

The algorithm above takes a string or array of strings. Title, tags and description/content have weights of 1, 0.5 and 0.3 respective. A threshold is set where items are filtered out entirely when score is lower or equal to 0 + the amount of search terms divided by 4. The values can be adjusted, but in essence, this is the algorithm I want to implement within the aggregate framework. How would it kind of look like? Thanks in advance.

2 Answers 2

1

You can use text indexes in aggregation - but it does have to be the first stage.

Here's my take, with only one search term:

const search = new RegExp(searchString, 'i');

collection.aggregate().match(hardFilters)
  // This step is not really necessary
  .match({
    $or: [{
      tags: search
    }, {
      title: search
    }, {
      content: search
    }]
  })
  .set({
    relevance: {
      $sum: [
          {$multiply: [{$size: {$regexFindAll: {input: "$title", regex: search}}}, 100]},
          {$multiply: [{$size: {$regexFindAll: {input: {
              $reduce: {
                 input: "$tags",
                 initialValue: "",
                 in: { $concat : ["$$value", " ", "$$this"] }
              }
          }, regex: search}}}, 50]},
          {$multiply: [{$size: {$regexFindAll: {input: "$content", regex: search}}}, 30]},
      ]
    }
  })
  .match({relevance: {$gte: searchString.length * 25}})
  .sort({relevance: -1});

With multiple search terms, maybe you could do this:

const search = new RegExp(searchStrings.join('|'), 'i');

It's possible to do searches for each tag individually, if you really want, by doing something like:

    relevance: {
      $sum: [].concat(...searches.map(search => [
          {$multiply: [{$size: {$regexFindAll: {input: "$title", regex: search}}}, 100]},
          {$multiply: [{$size: {$regexFindAll: {input: ..., regex: search}}}, 50]},
          {$multiply: [{$size: {$regexFindAll: {input: "$content", regex: search}}}, 30]},
      ]))
    }

And maybe you could add boundary checks, regardless of multiple or single search:

const search = new RegExp("\b" + searchStrings.join('|') + "\b", 'i');
Sign up to request clarification or add additional context in comments.

6 Comments

Well, you forced me to update MongoDB to 4.2 due to the $set method in aggregation not being supported by 4.0 :P I noticed one issue right away with the first block. The tags are stored in the database as an array of strings. This could of course be changed... but wouldn't be a way to join the strings together, like the JS/TS .join() method? $concatArray gave me a similar error: "$regexFindAll needs 'input' to be of type string".
Which part of the current query fails? I use $concat when assigning relevance, and the first optional $match should work too
On this line specifically: {$multiply: [{$size: {$regexFindAll: {input: {$concat: "$tags"}, regex: searchString}}}, 50]}, I have changed "search" to "searchString" here because search is the unprocessed input stream from the route. The error that is being thrown is the following: "$concat only supports strings, not array" Do you use MongoDB 4.4? I just updated to 4.2. I don't know if the behavior of $concat was changed in 4.4, if yes, I might have to update again.
Are you sure you used $concat and not $concatArrays?
@Saddex you're right. It seems like a bug to me. Anyway the updated answer should be good!
|
1

Given that Atlas Search returns documents sorted by relevancy by default and uses an inverted index, it seems like that is the tool for the job here. Relevancy will be much better and more customizable. Depending on what you are you building, you also get other features you’ll likely benefit from like highlighting and autocomplete.

2 Comments

That's a good tip that I might consider in the future, but I am sticking with my local installation of MongoDB on my DigitalOcean VPS for now. Thanks anyway!
@Saddex I think that makes a lot of sense. When you get started, if you have any trouble please ping me here or elsewhere. I will be happy to help. I love MongoDB and search.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.