@@ -24,13 +24,20 @@ The current implementation of GiST supports:
2424 * provides NULL-safe interface to GiST core
2525 * Concurrency
2626 * Recovery support via WAL logging
27+ * Buffering build algorithm
2728
2829The support for concurrency implemented in PostgreSQL was developed based on
2930the paper "Access Methods for Next-Generation Database Systems" by
3031Marcel Kornaker:
3132
3233 http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
3334
35+ Buffering build algorithm for GiST was developed based on the paper "Efficient
36+ Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
37+ and Jeffrey Scott Vitter.
38+
39+ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
40+
3441The original algorithms were modified in several ways:
3542
3643* They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
278285with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
279286crashed in the middle to completion by adding the downlink in the parent.
280287
288+ Buffering build algorithm
289+ -------------------------
290+
291+ In the buffering index build algorithm, some or all internal nodes have a
292+ buffer attached to them. When a tuple is inserted at the top, the descend down
293+ the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
294+ the buffer. When a buffer gets too full, all the tuples in it are flushed to
295+ the lower level, where they again hit lower level buffers or leaf pages. This
296+ makes the insertions happen in more of a breadth-first than depth-first order,
297+ which greatly reduces the amount of random I/O required.
298+
299+ In the algorithm, levels are numbered so that leaf pages have level zero,
300+ and internal node levels count up from 1. This numbering ensures that a page's
301+ level number never changes, even when the root page is split.
302+
303+ Level Tree
304+
305+ 3 *
306+ / \
307+ 2 * *
308+ / | \ / | \
309+ 1 * * * * * *
310+ / \ / \ / \ / \ / \ / \
311+ 0 o o o o o o o o o o o o
312+
313+ * - internal page
314+ o - leaf page
315+
316+ Internal pages that belong to certain levels have buffers associated with
317+ them. Leaf pages never have buffers. Which levels have buffers is controlled
318+ by "level step" parameter: level numbers that are multiples of level_step
319+ have buffers, while others do not. For example, if level_step = 2, then
320+ pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
321+ internal page has a buffer.
322+
323+ Level Tree (level_step = 1) Tree (level_step = 2)
324+
325+ 3 * *
326+ / \ / \
327+ 2 *(b) *(b) *(b) *(b)
328+ / | \ / | \ / | \ / | \
329+ 1 *(b) *(b) *(b) *(b) *(b) *(b) * * * * * *
330+ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \
331+ 0 o o o o o o o o o o o o o o o o o o o o o o o o
332+
333+ (b) - buffer
334+
335+ Logically, a buffer is just bunch of tuples. Physically, it is divided in
336+ pages, backed by a temporary file. Each buffer can be in one of two states:
337+ a) Last page of the buffer is kept in main memory. A node buffer is
338+ automatically switched to this state when a new index tuple is added to it,
339+ or a tuple is removed from it.
340+ b) All pages of the buffer are swapped out to disk. When a buffer becomes too
341+ full, and we start to flush it, all other buffers are switched to this state.
342+
343+ When an index tuple is inserted, its initial processing can end in one of the
344+ following points:
345+ 1) Leaf page, if the depth of the index <= level_step, meaning that
346+ none of the internal pages have buffers associated with them.
347+ 2) Buffer of topmost level page that has buffers.
348+
349+ New index tuples are processed until one of the buffers in the topmost
350+ buffered level becomes half-full. When a buffer becomes half-full, it's added
351+ to the emptying queue, and will be emptied before a new tuple is processed.
352+
353+ Buffer emptying process means that index tuples from the buffer are moved
354+ into buffers at a lower level, or leaf pages. First, all the other buffers are
355+ swapped to disk to free up the memory. Then tuples are popped from the buffer
356+ one by one, and cascaded down the tree to the next buffer or leaf page below
357+ the buffered node.
358+
359+ Emptying a buffer has the interesting dynamic property that any intermediate
360+ pages between the buffer being emptied, and the next buffered or leaf level
361+ below it, become cached. If there are no more buffers below the node, the leaf
362+ pages where the tuples finally land on get cached too. If there are, the last
363+ buffer page of each buffer below is kept in memory. This is illustrated in
364+ the figures below:
365+
366+ Buffer being emptied to
367+ lower-level buffers Buffer being emptied to leaf pages
368+
369+ +(fb) +(fb)
370+ / \ / \
371+ + + + +
372+ / \ / \ / \ / \
373+ *(ab) *(ab) *(ab) *(ab) x x x x
374+
375+ + - cached internal page
376+ x - cached leaf page
377+ * - non-cached internal page
378+ (fb) - buffer being emptied
379+ (ab) - buffers being appended to, with last page in memory
380+
381+ In the beginning of the index build, the level-step is chosen so that all those
382+ pages involved in emptying one buffer fit in cache, so after each of those
383+ pages have been accessed once and cached, emptying a buffer doesn't involve
384+ any more I/O. This locality is where the speedup of the buffering algorithm
385+ comes from.
386+
387+ Emptying one buffer can fill up one or more of the lower-level buffers,
388+ triggering emptying of them as well. Whenever a buffer becomes too full, it's
389+ added to the emptying queue, and will be emptied after the current buffer has
390+ been processed.
391+
392+ To keep the size of each buffer limited even in the worst case, buffer emptying
393+ is scheduled as soon as a buffer becomes half-full, and emptying it continues
394+ until 1/2 of the nominal buffer size worth of tuples has been emptied. This
395+ guarantees that when buffer emptying begins, all the lower-level buffers
396+ are at most half-full. In the worst case that all the tuples are cascaded down
397+ to the same lower-level buffer, that buffer therefore has enough space to
398+ accommodate all the tuples emptied from the upper-level buffer. There is no
399+ hard size limit in any of the data structures used, though, so this only needs
400+ to be approximate; small overfilling of some buffers doesn't matter.
401+
402+ If an internal page that has a buffer associated with it is split, the buffer
403+ needs to be split too. All tuples in the buffer are scanned through and
404+ relocated to the correct sibling buffers, using the penalty function to decide
405+ which buffer each tuple should go to.
406+
407+ After all tuples from the heap have been processed, there are still some index
408+ tuples in the buffers. At this point, final buffer emptying starts. All buffers
409+ are emptied in top-down order. This is slightly complicated by the fact that
410+ new buffers can be allocated during the emptying, due to page splits. However,
411+ the new buffers will always be siblings of buffers that haven't been fully
412+ emptied yet; tuples never move upwards in the tree. The final emptying loops
413+ through buffers at a given level until all buffers at that level have been
414+ emptied, and then moves down to the next level.
415+
281416
282417Authors:
283418 Teodor Sigaev <teodor@sigaev.ru>
0 commit comments