1
0
mirror of https://github.com/postgres/postgres.git synced 2025-06-27 23:21:58 +03:00

Buffering GiST index build algorithm.

When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.

Alexander Korotkov
This commit is contained in:
Heikki Linnakangas
2011-09-08 17:51:23 +03:00
parent 09b68c70af
commit 5edb24a898
11 changed files with 2297 additions and 186 deletions

View File

@ -24,6 +24,7 @@ The current implementation of GiST supports:
* provides NULL-safe interface to GiST core
* Concurrency
* Recovery support via WAL logging
* Buffering build algorithm
The support for concurrency implemented in PostgreSQL was developed based on
the paper "Access Methods for Next-Generation Database Systems" by
@ -31,6 +32,12 @@ Marcel Kornaker:
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/access-methods-for-next-generation.pdf.gz
Buffering build algorithm for GiST was developed based on the paper "Efficient
Bulk Operations on Dynamic R-trees" by Lars Arge, Klaus Hinrichs, Jan Vahrenhold
and Jeffrey Scott Vitter.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.9894&rep=rep1&type=pdf
The original algorithms were modified in several ways:
* They had to be adapted to PostgreSQL conventions. For example, the SEARCH
@ -278,6 +285,134 @@ would complicate the insertion algorithm. So when an insertion sees a page
with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
crashed in the middle to completion by adding the downlink in the parent.
Buffering build algorithm
-------------------------
In the buffering index build algorithm, some or all internal nodes have a
buffer attached to them. When a tuple is inserted at the top, the descend down
the tree is stopped as soon as a buffer is reached, and the tuple is pushed to
the buffer. When a buffer gets too full, all the tuples in it are flushed to
the lower level, where they again hit lower level buffers or leaf pages. This
makes the insertions happen in more of a breadth-first than depth-first order,
which greatly reduces the amount of random I/O required.
In the algorithm, levels are numbered so that leaf pages have level zero,
and internal node levels count up from 1. This numbering ensures that a page's
level number never changes, even when the root page is split.
Level Tree
3 *
/ \
2 * *
/ | \ / | \
1 * * * * * *
/ \ / \ / \ / \ / \ / \
0 o o o o o o o o o o o o
* - internal page
o - leaf page
Internal pages that belong to certain levels have buffers associated with
them. Leaf pages never have buffers. Which levels have buffers is controlled
by "level step" parameter: level numbers that are multiples of level_step
have buffers, while others do not. For example, if level_step = 2, then
pages on levels 2, 4, 6, ... have buffers. If level_step = 1 then every
internal page has a buffer.
Level Tree (level_step = 1) Tree (level_step = 2)
3 * *
/ \ / \
2 *(b) *(b) *(b) *(b)
/ | \ / | \ / | \ / | \
1 *(b) *(b) *(b) *(b) *(b) *(b) * * * * * *
/ \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \
0 o o o o o o o o o o o o o o o o o o o o o o o o
(b) - buffer
Logically, a buffer is just bunch of tuples. Physically, it is divided in
pages, backed by a temporary file. Each buffer can be in one of two states:
a) Last page of the buffer is kept in main memory. A node buffer is
automatically switched to this state when a new index tuple is added to it,
or a tuple is removed from it.
b) All pages of the buffer are swapped out to disk. When a buffer becomes too
full, and we start to flush it, all other buffers are switched to this state.
When an index tuple is inserted, its initial processing can end in one of the
following points:
1) Leaf page, if the depth of the index <= level_step, meaning that
none of the internal pages have buffers associated with them.
2) Buffer of topmost level page that has buffers.
New index tuples are processed until one of the buffers in the topmost
buffered level becomes half-full. When a buffer becomes half-full, it's added
to the emptying queue, and will be emptied before a new tuple is processed.
Buffer emptying process means that index tuples from the buffer are moved
into buffers at a lower level, or leaf pages. First, all the other buffers are
swapped to disk to free up the memory. Then tuples are popped from the buffer
one by one, and cascaded down the tree to the next buffer or leaf page below
the buffered node.
Emptying a buffer has the interesting dynamic property that any intermediate
pages between the buffer being emptied, and the next buffered or leaf level
below it, become cached. If there are no more buffers below the node, the leaf
pages where the tuples finally land on get cached too. If there are, the last
buffer page of each buffer below is kept in memory. This is illustrated in
the figures below:
Buffer being emptied to
lower-level buffers Buffer being emptied to leaf pages
+(fb) +(fb)
/ \ / \
+ + + +
/ \ / \ / \ / \
*(ab) *(ab) *(ab) *(ab) x x x x
+ - cached internal page
x - cached leaf page
* - non-cached internal page
(fb) - buffer being emptied
(ab) - buffers being appended to, with last page in memory
In the beginning of the index build, the level-step is chosen so that all those
pages involved in emptying one buffer fit in cache, so after each of those
pages have been accessed once and cached, emptying a buffer doesn't involve
any more I/O. This locality is where the speedup of the buffering algorithm
comes from.
Emptying one buffer can fill up one or more of the lower-level buffers,
triggering emptying of them as well. Whenever a buffer becomes too full, it's
added to the emptying queue, and will be emptied after the current buffer has
been processed.
To keep the size of each buffer limited even in the worst case, buffer emptying
is scheduled as soon as a buffer becomes half-full, and emptying it continues
until 1/2 of the nominal buffer size worth of tuples has been emptied. This
guarantees that when buffer emptying begins, all the lower-level buffers
are at most half-full. In the worst case that all the tuples are cascaded down
to the same lower-level buffer, that buffer therefore has enough space to
accommodate all the tuples emptied from the upper-level buffer. There is no
hard size limit in any of the data structures used, though, so this only needs
to be approximate; small overfilling of some buffers doesn't matter.
If an internal page that has a buffer associated with it is split, the buffer
needs to be split too. All tuples in the buffer are scanned through and
relocated to the correct sibling buffers, using the penalty function to decide
which buffer each tuple should go to.
After all tuples from the heap have been processed, there are still some index
tuples in the buffers. At this point, final buffer emptying starts. All buffers
are emptied in top-down order. This is slightly complicated by the fact that
new buffers can be allocated during the emptying, due to page splits. However,
the new buffers will always be siblings of buffers that haven't been fully
emptied yet; tuples never move upwards in the tree. The final emptying loops
through buffers at a given level until all buffers at that level have been
emptied, and then moves down to the next level.
Authors:
Teodor Sigaev <teodor@sigaev.ru>