Mongodb query optimization - running query in parallel

Question

I am trying to run some wild card/regex based query on mongo cluster from java driver. Mongo Replica Set config: 3 member replica 16 CPU(hyperthreaded), 24G RAM Linux x86_64 Collection size: 6M rows, 7G data

Client is localhost (mac osx 10.8) with latest mongo-java driver

Query using java driver with readpref = primaryPreffered

 { "$and" : [{ "$or" : [ { "country" : "united states"}]} , { "$or" : [ { "registering_organization" : { "$regex" : "^.*itt.*hartford.*$"}} , { "registering_organization" : { "$regex" : "^.*met.*life.*$"}} , { "registering_organization" : { "$regex" : "^.*cardinal.*health.*$"}}]}]}

I have regular index on both "country" and "registering_organization". But as per mongo docs a single query can utilize only one index and I can see that from explain() on above query as well.

So my question is what is the best alternative to achieve better performance in above query. Should I break the 'and' operations and do in memory intersection. Going further I shall have 'Not' operations in query too.

I think my application may turn into reporting/analytics in future but that's not down the line or i am not looking to design accordingly.

Chris Winslett · Accepted Answer · 2013-08-19 19:38:23Z

1

There are so many things wrong with this query.

Your nested conditional with regexes will never get faster in MongoDB. MongoDB is not the best tool for "data discovery" (e.g. ad-hoc, multi-conditional queries for uncovering unknown information). MongoDB is blazing fast when you know the metrics you are generating. But, not for data discovery.

If this is a common query you are running, then I would create an attribute called "united_states_or_health_care", and set the value to the timestamp of the create date. With this method, you are moving your logic from your query to your document schema. This is one common way to think about scaling with MongoDB.

If you are doing data discovery, you have a few different options:

Have your application concatenate the results of the different queries
Run query on a secondary MongoDB, and accept slower performance
Pipe your data to Postgresql using mosql. Postgres will run these data-discovery queries much faster.

Another Tip:

Your regexes are not anchored in a way to be fast. It would be best to run your "registering_organization" attribute through a "findable_registering_organization" filter. The filter would break apart the organization into an array of queryable name subsets, and you would quite using the regexes. +2 points if you can filter incoming names by an industry lookup.

answered Aug 19, 2013 at 19:38

Chris Winslett

8465 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

nir Over a year ago

1) I agree MongoDB is not best tool for data discovery. I assume MySQL would be equally faster as Postgresql as we use MySQL heavily in other projects 2) I like the idea of having tokenized "findable_registering_organization" field. However, if I am not wrong I will be loosing ordering property of the wildcard in this approach. I see as a accaptable tradeoff (3) I can utilize FTS like lucene to have better wildcard search

nir Over a year ago

One more thing about Mongo $OR query. I ran only $OR query with multiple clasuses. According to docs that query should run in parallel but I don't see it running in parallel. Though I have 16 cores on mongo server I can see only one core being utilized!

Collectives™ on Stack Overflow

Mongodb query optimization - running query in parallel

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related