I would like to know what kind of internal indexing algorithm MongoDB is using. Because I have some data want to store, and each document (row) has a id, which is probably a unique hash value. (e.g. generated by md5() or other hash algorithm). so, I would to understand which hash method I should use to create the id, so that it is fast for the MongoDB to index it. :)
2 Answers
Yes, mongoDB use b-tree, documentation:
An index is a data structure that collects information about the values of the specified fields in the documents of a collection. This data structure is used by Mongo's query optimizer to quickly sort through and order the documents in a collection. Formally speaking, these indexes are implemented as "B-Tree" indexes.
I suggest to use mongodb ObjectId for collection _id, and don't care about: "How to create _id?" at all. Because it probably task for mongodb, but not for developer. I suppose that better to care about schema, indexes, etc..
3 Comments
For Mongo 3.2+, the default storage engine is WiredTiger, and B+ tree is used to store data.
WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
And also LSM Tree is used to store data
WiredTiger supports Log-Structured Merge Trees, where updates are buffered in small files that fit in cache for fast random updates, then automatically merged into larger files in the background so that read latency approaches that of traditional Btree files. LSM trees automatically create Bloom filters to avoid unnecessary reads from files that cannot containing matching keys.
| LSM | B+ Tree | |
|---|---|---|
| Pros | * More space efficient by using append-only writes and periodic compaction * Better performance with fast-growing data and sequential writes * No fragmentation penalty because of how SSTable files are written and updated |
* Excellent performance with read-intensive workloads |
| Cons | * CPU overhead for compaction can meaningfully affect performance and efficiency if not tuned appropriately * More tuning options increase flexibility but can seem complex to developers and operators * Read/search performance can be optimized with the use of bloom filters |
* Increased space overhead to deal with fragmentation * Uses random writes which causes slower create/insert behavior * Concurrent writes may require locks which slows write performance * Scaling challenges, especially with >50% write transactions |
The choice advice
- If you don't require extreme write throughput
btreeis likely to be a better choice. Read throughput is better and high volumes of writes can be maintained. - If you have a workload that requires a high write throughput
LSMis the best choice.
Source:
4 Comments
And also LSM Tree is used to store data? only 1 of them would be getting used right?