mirror of
https://github.com/postgres/postgres.git
synced 2025-05-29 16:21:20 +03:00
Revise GIN README
We find GIN concurrency bugs from time to time. One of the problems here is that concurrency of GIN isn't well-documented in README. So, it might be even hard to distinguish design bugs from implementation bugs. This commit revised concurrency section in GIN README providing more details. Some examples are illustrated in ASCII art. Also, this commit add the explanation of how is tuple layout in internal GIN B-tree page different in comparison with nbtree. Discussion: https://postgr.es/m/CAPpHfduXR_ywyaVN4%2BOYEGaw%3DcPLzWX6RxYLBncKw8de9vOkqw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
This commit is contained in:
parent
99f5888d35
commit
84dcf52359
@ -215,6 +215,35 @@ fit on one pending-list page must have those pages to itself, even if this
|
||||
results in wasting much of the space on the preceding page and the last
|
||||
page for the tuple.)
|
||||
|
||||
GIN packs downlinks and pivot keys into internal page tuples in a different way
|
||||
than nbtree does. Lehman & Yao defines it as following.
|
||||
|
||||
P_0, K_1, P_1, K_2, P_2, ... , K_n, P_n, K_{n+1}
|
||||
|
||||
There P_i is a downlink and K_i is a key. K_i splits key space between P_{i-1}
|
||||
and P_i (0 <= i <= n). K_{n+1} is high key.
|
||||
|
||||
In internal page tuple is key and downlink grouped together. nbtree packs
|
||||
keys and downlinks into tuples as following.
|
||||
|
||||
(K_{n+1}, None), (-Inf, P_0), (K_1, P_1), ... , (K_n, P_n)
|
||||
|
||||
There tuples are shown in parentheses. So, highkey is stored separately. P_i
|
||||
is grouped with K_i. P_0 is grouped with -Inf key.
|
||||
|
||||
GIN packs keys and downlinks into tuples in a different way.
|
||||
|
||||
(P_0, K_1), (P_1, K_2), ... , (P_n, K_{n+1})
|
||||
|
||||
P_i is grouped with K_{i+1}. -Inf key is not needed.
|
||||
|
||||
There are couple of additional notes regarding K_{n+1} key.
|
||||
1) In entry tree rightmost page, a key coupled with P_n doesn't really matter.
|
||||
Highkey is assumed to be infinity.
|
||||
2) In posting tree, a key coupled with P_n always doesn't matter. Highkey for
|
||||
non-rightmost pages is stored separately and accessed via
|
||||
GinDataPageGetRightBound().
|
||||
|
||||
Posting tree
|
||||
------------
|
||||
|
||||
@ -277,50 +306,148 @@ followed by the packed items.
|
||||
Concurrency
|
||||
-----------
|
||||
|
||||
The entry tree and each posting tree is a B-tree, with right-links connecting
|
||||
sibling pages at the same level. This is the same structure that is used in
|
||||
The entry tree and each posting tree are B-trees, with right-links connecting
|
||||
sibling pages at the same level. This is the same structure that is used in
|
||||
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
|
||||
scanning a GIN trees backwards, so we don't need left-links.
|
||||
scanning a GIN trees backwards, so we don't need left-links. The entry tree
|
||||
leaves don't have dedicated high keys, instead greatest leaf tuple serves as
|
||||
high key. That works because tuples are never deleted from the entry tree.
|
||||
|
||||
To avoid deadlocks, B-tree pages must always be locked in the same order:
|
||||
left to right, and bottom to top. When searching, the tree is traversed from
|
||||
top to bottom, so the lock on the parent page must be released before
|
||||
descending to the next level. Concurrent page splits move the keyspace to
|
||||
right, so after following a downlink, the page actually containing the key
|
||||
we're looking for might be somewhere to the right of the page we landed on.
|
||||
In that case, we follow the right-links until we find the page we're looking
|
||||
for.
|
||||
The algorithms used to operate entry and posting trees are considered below.
|
||||
|
||||
To delete a page, the page's left sibling, the target page, and its parent,
|
||||
are locked in that order, and the page is marked as deleted. However, a
|
||||
concurrent search might already have read a pointer to the page, and might be
|
||||
just about to follow it. A page can be reached via the right-link of its left
|
||||
sibling, or via its downlink in the parent.
|
||||
### Locating the leaf page
|
||||
|
||||
To prevent a backend from reaching a deleted page via a right-link, when
|
||||
following a right-link the lock on the previous page is not released until
|
||||
the lock on next page has been acquired.
|
||||
When we search for leaf page in GIN btree to perform a read, we descend from
|
||||
the root page to the leaf through using downlinks taking pin and shared lock on
|
||||
one page at once. So, we release pin and shared lock on previous page before
|
||||
getting them on the next page.
|
||||
|
||||
The downlink is more tricky. A search descending the tree must release the
|
||||
lock on the parent page before locking the child, or it could deadlock with
|
||||
a concurrent split of the child page; a page split locks the parent, while
|
||||
already holding a lock on the child page. So, deleted page cannot be reclaimed
|
||||
immediately. Instead, we have to wait for every transaction, which might wait
|
||||
to reference this page, to finish. Corresponding processes must observe that
|
||||
The picture below shows tree state after finding the leaf page. Lower case
|
||||
letters depicts tree pages. 'S' depicts shared lock on the page.
|
||||
|
||||
a
|
||||
/ | \
|
||||
b c d
|
||||
/ | \ | \ | \
|
||||
eS f g h i j k
|
||||
|
||||
### Steping right
|
||||
|
||||
Concurrent page splits move the keyspace to right, so after following a
|
||||
downlink, the page actually containing the key we're looking for might be
|
||||
somewhere to the right of the page we landed on. In that case, we follow the
|
||||
right-links until we find the page we're looking for.
|
||||
|
||||
During stepping right we take pin and shared lock on the right sibling before
|
||||
releasing them from the current page. This mechanism was designed to protect
|
||||
from stepping to delete page. We step to the right sibling while hold lock on
|
||||
the rightlink pointing there. So, it's guaranteed that nobody updates rightlink
|
||||
concurrently and doesn't delete right sibling accordingly.
|
||||
|
||||
The picture below shows two pages locked at once during stepping right.
|
||||
|
||||
a
|
||||
/ | \
|
||||
b c d
|
||||
/ | \ | \ | \
|
||||
eS fS g h i j k
|
||||
|
||||
### Insert
|
||||
|
||||
While finding appropriate leaf for insertion we also descend from the root to
|
||||
leaf, while shared locking one page at once in. But during insertion we don't
|
||||
release pins from root and internal pages. That could save us some lookups to
|
||||
the buffers hash table for downlinks insertion assuming parents are not changed
|
||||
due to concurrent splits. Once we reach leaf we re-lock the page in exclusive
|
||||
mode.
|
||||
|
||||
The picture below shows leaf page locked in exclusive mode and ready for
|
||||
insertion. 'P' and 'E' depict pin and exclusive lock correspondingly.
|
||||
|
||||
|
||||
aP
|
||||
/ | \
|
||||
b cP d
|
||||
/ | \ | \ | \
|
||||
e f g hE i j k
|
||||
|
||||
|
||||
If insert causes a page split, the parent is locked in exclusive mode before
|
||||
unlocking the left child. So, insertion algorithm can exclusively lock both
|
||||
parent and child pages at once starting from child.
|
||||
|
||||
The picture below shows tree state after leaf page split. 'q' is new page
|
||||
produced by split. Parent 'c' is about to have downlink inserted.
|
||||
|
||||
aP
|
||||
/ | \
|
||||
b cE d
|
||||
/ | \ / | \ | \
|
||||
e f g hE q i j k
|
||||
|
||||
|
||||
### Page deletion
|
||||
|
||||
Vacuum never deletes tuples or pages from the entry tree. It traverses entry
|
||||
tree leafs in logical order by rightlinks and removes deletable TIDs from
|
||||
posting lists. Posting trees are processed by links from entry tree leafs. They
|
||||
are vacuumed in two stages.
|
||||
|
||||
At first stage, ginVacuumPostingTreeLeaves() removes deletable TIDs are removed
|
||||
from leafs. ginVacuumPostingTreeLeaves() traverses the whole tree in depth-first
|
||||
manner. It starts from the super-exclusive lock on the tree root. This lock
|
||||
prevents all the concurrent insertions into this tree while we're deleting
|
||||
pages. However, there are still might be some in-progress readers, who traversed
|
||||
root before we locked it.
|
||||
|
||||
The picture below shows tree during removing deletable TIDs from leftmost tree
|
||||
lead.
|
||||
|
||||
aE
|
||||
/ | \
|
||||
bE c d
|
||||
/ | \ | \ | \
|
||||
eE f g h i j k
|
||||
|
||||
ginVacuumPostingTreeLeaves() algorithm keeps exclusive locks on pages comprising
|
||||
currently investigated path.
|
||||
|
||||
If first stage detects at least one empty page, then at the second stage
|
||||
ginScanToDelete() deletes empty pages. ginScanToDelete() keeps lock on the root
|
||||
acquired by ginVacuumPostingTreeLeaves(). It also traverses the whole tree in
|
||||
depth-first manner, but keeps one page exclusively locked at once. That's safe
|
||||
because root lock guarantees there is no concurrent page modifications. When
|
||||
page is about to be deleted, pages are relocked in following order: left,
|
||||
deletable, parent. This order guarantees no deadlock with concurrent stepping
|
||||
right.
|
||||
|
||||
The picture below shows tree state before deletion of page 'g'.
|
||||
|
||||
aE
|
||||
/ | \
|
||||
bE c d
|
||||
/ | \ | \ | \
|
||||
e fE gE h i j k
|
||||
|
||||
A search concurrent to page deletion might already have read a pointer to the
|
||||
page to be deleted, and might be just about to follow it. A page can be reached
|
||||
via the right-link of its left sibling, or via its downlink in the parent.
|
||||
|
||||
To prevent a backend from reaching a deleted page via a right-link, stepping
|
||||
right algorithm doesn't release lock on the current page until lock of the
|
||||
right page is acquired.
|
||||
|
||||
The downlink is more tricky. A search descending the tree must release the lock
|
||||
on the parent page before locking the child, or it could deadlock with a
|
||||
concurrent split of the child page; a page split locks the parent, while already
|
||||
holding a lock on the child page. So, deleted page cannot be reclaimed
|
||||
immediately. Instead, we have to wait for every transaction, which might wait
|
||||
to reference this page, to finish. Corresponding processes must observe that
|
||||
the page is marked deleted and recover accordingly.
|
||||
|
||||
The previous paragraph's reasoning only applies to searches, and only to
|
||||
posting trees. To protect from inserters following a downlink to a deleted
|
||||
page, vacuum simply locks out all concurrent insertions to the posting tree,
|
||||
by holding a super-exclusive lock on the posting tree root. Inserters hold a
|
||||
pin on the root page, but searches do not, so while new searches cannot begin
|
||||
while root page is locked, any already-in-progress scans can continue
|
||||
concurrently with vacuum. In the entry tree, we never delete pages.
|
||||
|
||||
(This is quite different from the mechanism the btree indexam uses to make
|
||||
page-deletions safe; it stamps the deleted pages with an XID and keeps the
|
||||
deleted pages around with the right-link intact until all concurrent scans
|
||||
have finished.)
|
||||
During the replay of page deletion at standby, the page's left sibling, the
|
||||
target page, and its parent, are locked in that order. This order guarantees
|
||||
no deadlock with concurrent reads.
|
||||
|
||||
Compatibility
|
||||
-------------
|
||||
|
Loading…
x
Reference in New Issue
Block a user