0

I am trying to run some wild card/regex based query on mongo cluster from java driver. Mongo Replica Set config: 3 member replica 16 CPU(hyperthreaded), 24G RAM Linux x86_64 Collection size: 6M rows, 7G data

Client is localhost (mac osx 10.8) with latest mongo-java driver

Query using java driver with readpref = primaryPreffered

 { "$and" : [{ "$or" : [ { "country" : "united states"}]} , { "$or" : [ { "registering_organization" : { "$regex" : "^.*itt.*hartford.*$"}} , { "registering_organization" : { "$regex" : "^.*met.*life.*$"}} , { "registering_organization" : { "$regex" : "^.*cardinal.*health.*$"}}]}]}

I have regular index on both "country" and "registering_organization". But as per mongo docs a single query can utilize only one index and I can see that from explain() on above query as well.

So my question is what is the best alternative to achieve better performance in above query. Should I break the 'and' operations and do in memory intersection. Going further I shall have 'Not' operations in query too.

I think my application may turn into reporting/analytics in future but that's not down the line or i am not looking to design accordingly.

1 Answer 1

1

There are so many things wrong with this query.

Your nested conditional with regexes will never get faster in MongoDB. MongoDB is not the best tool for "data discovery" (e.g. ad-hoc, multi-conditional queries for uncovering unknown information). MongoDB is blazing fast when you know the metrics you are generating. But, not for data discovery.

If this is a common query you are running, then I would create an attribute called "united_states_or_health_care", and set the value to the timestamp of the create date. With this method, you are moving your logic from your query to your document schema. This is one common way to think about scaling with MongoDB.

If you are doing data discovery, you have a few different options:

  • Have your application concatenate the results of the different queries
  • Run query on a secondary MongoDB, and accept slower performance
  • Pipe your data to Postgresql using mosql. Postgres will run these data-discovery queries much faster.

Another Tip:

Your regexes are not anchored in a way to be fast. It would be best to run your "registering_organization" attribute through a "findable_registering_organization" filter. The filter would break apart the organization into an array of queryable name subsets, and you would quite using the regexes. +2 points if you can filter incoming names by an industry lookup.

Sign up to request clarification or add additional context in comments.

2 Comments

1) I agree MongoDB is not best tool for data discovery. I assume MySQL would be equally faster as Postgresql as we use MySQL heavily in other projects 2) I like the idea of having tokenized "findable_registering_organization" field. However, if I am not wrong I will be loosing ordering property of the wildcard in this approach. I see as a accaptable tradeoff (3) I can utilize FTS like lucene to have better wildcard search
One more thing about Mongo $OR query. I ran only $OR query with multiple clasuses. According to docs that query should run in parallel but I don't see it running in parallel. Though I have 16 cores on mongo server I can see only one core being utilized!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.