1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-11 10:01:57 +03:00

Standardize cleanup lock terminology.

The term "super-exclusive lock" is a synonym for "buffer cleanup lock"
that first appeared in nbtree many years ago.  Standardize things by
consistently using the term cleanup lock.  This finishes work started by
commit 276db875.

There is no good reason to have two terms.  But there is a good reason
to only have one: to avoid confusion around why VACUUM acquires a full
cleanup lock (not just an ordinary exclusive lock) in index AMs, during
ambulkdelete calls.  This has nothing to do with protecting the physical
index data structure itself.  It is needed to implement a locking
protocol that ensures that TIDs pointing to the heap/table structure
cannot get marked for recycling by VACUUM before it is safe (which is
somewhat similar to how VACUUM uses cleanup locks during its first heap
pass).  Note that it isn't strictly necessary for index AMs to implement
this locking protocol -- several index AMs use an MVCC snapshot as their
sole interlock to prevent unsafe TID recycling.

In passing, update the nbtree README.  Cleanly separate discussion of
the aforementioned index vacuuming locking protocol from discussion of
the "drop leaf page pin" optimization added by commit 2ed5b87f.  We now
structure discussion of the latter by describing how individual index
scans may safely opt out of applying the standard locking protocol (and
so can avoid blocking progress by VACUUM).  Also document why the
optimization is not safe to apply during nbtree index-only scans.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzngHgQa92tz6NQihf4nxJwRzCV36yMJO_i8dS+2mgEVKw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkHPgsBBvGWjz=8PjNhDefy7XRkDKiT5NxMs-n5ZCf2dA@mail.gmail.com
This commit is contained in:
Peter Geoghegan
2021-12-08 17:24:45 -08:00
parent 6f0e6ab04d
commit bcf60585e6
9 changed files with 107 additions and 95 deletions

View File

@ -396,7 +396,7 @@ leafs. If first stage detects at least one empty page, then at the second stage
ginScanToDelete() deletes empty pages. ginScanToDelete() deletes empty pages.
ginScanToDelete() traverses the whole tree in depth-first manner. It starts ginScanToDelete() traverses the whole tree in depth-first manner. It starts
from the super-exclusive lock on the tree root. This lock prevents all the from the full cleanup lock on the tree root. This lock prevents all the
concurrent insertions into this tree while we're deleting pages. However, concurrent insertions into this tree while we're deleting pages. However,
there are still might be some in-progress readers, who traversed root before there are still might be some in-progress readers, who traversed root before
we locked it. we locked it.

View File

@ -8448,7 +8448,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
/* /*
* Handles XLOG_HEAP2_PRUNE record type. * Handles XLOG_HEAP2_PRUNE record type.
* *
* Acquires a super-exclusive lock. * Acquires a full cleanup lock.
*/ */
static void static void
heap_xlog_prune(XLogReaderState *record) heap_xlog_prune(XLogReaderState *record)
@ -8534,7 +8534,7 @@ heap_xlog_prune(XLogReaderState *record)
/* /*
* Handles XLOG_HEAP2_VACUUM record type. * Handles XLOG_HEAP2_VACUUM record type.
* *
* Acquires an exclusive lock only. * Acquires an ordinary exclusive lock only.
*/ */
static void static void
heap_xlog_vacuum(XLogReaderState *record) heap_xlog_vacuum(XLogReaderState *record)

View File

@ -844,7 +844,7 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
/* /*
* Perform the actual page changes needed by heap_page_prune. * Perform the actual page changes needed by heap_page_prune.
* It is expected that the caller has a super-exclusive lock on the * It is expected that the caller has a full cleanup lock on the
* buffer. * buffer.
*/ */
void void

View File

@ -166,53 +166,40 @@ that the incoming item doesn't fit on the split page where it needs to go!
Deleting index tuples during VACUUM Deleting index tuples during VACUUM
----------------------------------- -----------------------------------
Before deleting a leaf item, we get a super-exclusive lock on the target Before deleting a leaf item, we get a full cleanup lock on the target
page, so that no other backend has a pin on the page when the deletion page, so that no other backend has a pin on the page when the deletion
starts. This is not necessary for correctness in terms of the btree index starts. This is not necessary for correctness in terms of the btree index
operations themselves; as explained above, index scans logically stop operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place. The reason we do it is to "between" pages and so can't lose their place. The reason we do it is to
provide an interlock between VACUUM and indexscans. Since VACUUM deletes provide an interlock between VACUUM and index scans that are not prepared
index entries before reclaiming heap tuple line pointers, the to deal with concurrent TID recycling when visiting the heap. Since only
super-exclusive lock guarantees that VACUUM can't reclaim for re-use a VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since
line pointer that an indexscanning process might be about to visit. This this only ever happens _after_ btbulkdelete returns, having index scans
guarantee works only for simple indexscans that visit the heap in sync hold on to the pin (used when reading from the leaf page) until _after_
with the index scan, not for bitmap scans. We only need the guarantee they're done visiting the heap (for TIDs from pinned leaf page) prevents
when using non-MVCC snapshot rules; when using an MVCC snapshot, it concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock
doesn't matter if the heap tuple is replaced with an unrelated tuple at until the index scan is totally finished processing its leaf page.
the same TID, because the new tuple won't be visible to our scan anyway.
Therefore, a scan using an MVCC snapshot which has no other confounding This approach is fairly coarse, so we avoid it whenever possible. In
factors will not hold the pin after the page contents are read. The practice most index scans won't hold onto their pin, and so won't block
current reasons for exceptions, where a pin is still needed, are if the VACUUM. These index scans must deal with TID recycling directly, which is
index is not WAL-logged or if the scan is an index-only scan. If later more complicated and not always possible. See later section on making
work allows the pin to be dropped for all cases we will be able to concurrent TID recycling safe.
simplify the vacuum code, since the concept of a super-exclusive lock
for btree indexes will no longer be needed. Opportunistic index tuple deletion performs almost the same page-level
modifications while only holding an exclusive lock. This is safe because
there is no question of TID recycling taking place later on -- only VACUUM
can make TIDs recyclable. See also simple deletion and bottom-up
deletion, below.
Because a pin is not always held, and a page can be split even while Because a pin is not always held, and a page can be split even while
someone does hold a pin on it, it is possible that an indexscan will someone does hold a pin on it, it is possible that an indexscan will
return items that are no longer stored on the page it has a pin on, but return items that are no longer stored on the page it has a pin on, but
rather somewhere to the right of that page. To ensure that VACUUM can't rather somewhere to the right of that page. To ensure that VACUUM can't
prematurely remove such heap tuples, we require btbulkdelete to obtain a prematurely make TIDs recyclable in this scenario, we require btbulkdelete
super-exclusive lock on every leaf page in the index, even pages that to obtain a cleanup lock on every leaf page in the index, even pages that
don't contain any deletable tuples. Any scan which could yield incorrect don't contain any deletable tuples. Note that this requirement does not
results if the tuple at a TID matching the scan's range and filter say that btbulkdelete must visit the pages in any particular order.
conditions were replaced by a different tuple while the scan is in
progress must hold the pin on each index page until all index entries read
from the page have been processed. This guarantees that the btbulkdelete
call cannot return while any indexscan is still holding a copy of a
deleted index tuple if the scan could be confused by that. Note that this
requirement does not say that btbulkdelete must visit the pages in any
particular order. (See also simple deletion and bottom-up deletion,
below.)
There is no such interlocking for deletion of items in internal pages,
since backends keep no lock nor pin on a page they have descended past.
Hence, when a backend is ascending the tree using its stack, it must
be prepared for the possibility that the item it wants is to the left of
the recorded position (but it can't have moved left out of the recorded
page). Since we hold a lock on the lower page (per L&Y) until we have
re-found the parent item that links to it, we can be assured that the
parent item does still exist and can't have been deleted.
VACUUM's linear scan, concurrent page splits VACUUM's linear scan, concurrent page splits
-------------------------------------------- --------------------------------------------
@ -453,6 +440,55 @@ whenever it is subsequently taken from the FSM for reuse. The deleted
page's contents will be overwritten by the split operation (it will become page's contents will be overwritten by the split operation (it will become
the new right sibling page). the new right sibling page).
Making concurrent TID recycling safe
------------------------------------
As explained in the earlier section about deleting index tuples during
VACUUM, we implement a locking protocol that allows individual index scans
to avoid concurrent TID recycling. Index scans opt-out (and so drop their
leaf page pin when visiting the heap) whenever it's safe to do so, though.
Dropping the pin early is useful because it avoids blocking progress by
VACUUM. This is particularly important with index scans used by cursors,
since idle cursors sometimes stop for relatively long periods of time. In
extreme cases, a client application may hold on to an idle cursors for
hours or even days. Blocking VACUUM for that long could be disastrous.
Index scans that don't hold on to a buffer pin are protected by holding an
MVCC snapshot instead. This more limited interlock prevents wrong answers
to queries, but it does not prevent concurrent TID recycling itself (only
holding onto the leaf page pin while accessing the heap ensures that).
Index-only scans can never drop their buffer pin, since they are unable to
tolerate having a referenced TID become recyclable. Index-only scans
typically just visit the visibility map (not the heap proper), and so will
not reliably notice that any stale TID reference (for a TID that pointed
to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
the heap by VACUUM. This could easily allow VACUUM to set the whole heap
page to all-visible in the visibility map immediately afterwards. An MVCC
snapshot is only sufficient to avoid problems during plain index scans
because they must access granular visibility information from the heap
proper. A plain index scan will even recognize LP_UNUSED items in the
heap (items that could be recycled but haven't been just yet) as "not
visible" -- even when the heap page is generally considered all-visible.
LP_DEAD setting of index tuples by the kill_prior_tuple optimization
(described in full in simple deletion, below) is also more complicated for
index scans that drop their leaf page pins. We must be careful to avoid
LP_DEAD-marking any new index tuple that looks like a known-dead index
tuple because it happens to share the same TID, following concurrent TID
recycling. It's just about possible that some other session inserted a
new, unrelated index tuple, on the same leaf page, which has the same
original TID. It would be totally wrong to LP_DEAD-set this new,
unrelated index tuple.
We handle this kill_prior_tuple race condition by having affected index
scans conservatively assume that any change to the leaf page at all
implies that it was reached by btbulkdelete in the interim period when no
buffer pin was held. This is implemented by not setting any LP_DEAD bits
on the leaf page at all when the page's LSN has changed. (That won't work
with an unlogged index, so for now we don't ever apply the "don't hold
onto pin" optimization there.)
Fastpath For Index Insertion Fastpath For Index Insertion
---------------------------- ----------------------------
@ -518,22 +554,6 @@ that's required for the deletion process to perform granular removal of
groups of dead TIDs from posting list tuples (without the situation ever groups of dead TIDs from posting list tuples (without the situation ever
being allowed to get out of hand). being allowed to get out of hand).
It's sufficient to have an exclusive lock on the index page, not a
super-exclusive lock, to do deletion of LP_DEAD items. It might seem
that this breaks the interlock between VACUUM and indexscans, but that is
not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple. This is another reason why
btbulkdelete has to get a super-exclusive lock on every leaf page, not only
the ones where it actually sees items to delete.
LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
drops its pin in the meantime. It must conservatively also remember the
LSN of the page, and only act to set LP_DEAD bits when the LSN has not
changed at all. (Avoiding dropping the pin entirely also makes it safe, of
course.)
Bottom-Up deletion Bottom-Up deletion
------------------ ------------------
@ -733,23 +753,21 @@ because it allows running applications to continue while the standby
changes state into a normally running server. changes state into a normally running server.
The interlocking required to avoid returning incorrect results from The interlocking required to avoid returning incorrect results from
non-MVCC scans is not required on standby nodes. We still get a non-MVCC scans is not required on standby nodes. We still get a full
super-exclusive lock ("cleanup lock") when replaying VACUUM records cleanup lock when replaying VACUUM records during recovery, but recovery
during recovery, but recovery does not need to lock every leaf page does not need to lock every leaf page (only those leaf pages that have
(only those leaf pages that have items to delete). That is safe because items to delete) -- that's sufficient to avoid breaking index-only scans
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(), during recovery (see section above about making TID recycling safe). That
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever leaves concern only for plain index scans. (XXX: Not actually clear why
used during write transactions, which cannot exist on the standby. MVCC this is totally unnecessary during recovery.)
scans are already protected by definition, so HeapTupleSatisfiesMVCC()
is not a problem. The optimizer looks at the boundaries of value ranges
using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
is also safe. That leaves concern only for HeapTupleSatisfiesToast().
HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's MVCC snapshot plain index scans are always safe, for the same reasons that
because it doesn't need to - if the main heap row is visible then the they're safe during original execution. HeapTupleSatisfiesToast() doesn't
toast rows will also be visible. So as long as we follow a toast use MVCC semantics, though that's because it doesn't need to - if the main
pointer from a visible (live) tuple the corresponding toast rows heap row is visible then the toast rows will also be visible. So as long
will also be visible, so we do not need to recheck MVCC on them. as we follow a toast pointer from a visible (live) tuple the corresponding
toast rows will also be visible, so we do not need to recheck MVCC on
them.
Other Things That Are Handy to Know Other Things That Are Handy to Know
----------------------------------- -----------------------------------

View File

@ -1115,8 +1115,7 @@ _bt_conditionallockbuf(Relation rel, Buffer buf)
} }
/* /*
* _bt_upgradelockbufcleanup() -- upgrade lock to super-exclusive/cleanup * _bt_upgradelockbufcleanup() -- upgrade lock to a full cleanup lock.
* lock.
*/ */
void void
_bt_upgradelockbufcleanup(Relation rel, Buffer buf) _bt_upgradelockbufcleanup(Relation rel, Buffer buf)
@ -1147,7 +1146,7 @@ _bt_pageinit(Page page, Size size)
/* /*
* Delete item(s) from a btree leaf page during VACUUM. * Delete item(s) from a btree leaf page during VACUUM.
* *
* This routine assumes that the caller has a super-exclusive write lock on * This routine assumes that the caller already has a full cleanup lock on
* the buffer. Also, the given deletable and updatable arrays *must* be * the buffer. Also, the given deletable and updatable arrays *must* be
* sorted in ascending order. * sorted in ascending order.
* *

View File

@ -1161,9 +1161,9 @@ backtrack:
nhtidslive; nhtidslive;
/* /*
* Trade in the initial read lock for a super-exclusive write lock on * Trade in the initial read lock for a full cleanup lock on this
* this page. We must get such a lock on every leaf page over the * page. We must get such a lock on every leaf page over the course
* course of the vacuum scan, whether or not it actually contains any * of the vacuum scan, whether or not it actually contains any
* deletable tuples --- see nbtree/README. * deletable tuples --- see nbtree/README.
*/ */
_bt_upgradelockbufcleanup(rel, buf); _bt_upgradelockbufcleanup(rel, buf);

View File

@ -50,16 +50,11 @@ static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
/* /*
* _bt_drop_lock_and_maybe_pin() * _bt_drop_lock_and_maybe_pin()
* *
* Unlock the buffer; and if it is safe to release the pin, do that, too. It * Unlock the buffer; and if it is safe to release the pin, do that, too.
* is safe if the scan is using an MVCC snapshot and the index is WAL-logged.
* This will prevent vacuum from stalling in a blocked state trying to read a * This will prevent vacuum from stalling in a blocked state trying to read a
* page when a cursor is sitting on it -- at least in many important cases. * page when a cursor is sitting on it.
* *
* Set the buffer to invalid if the pin is released, since the buffer may be * See nbtree/README section on making concurrent TID recycling safe.
* re-used. If we need to go back to this block (for example, to apply
* LP_DEAD hints) we must get a fresh reference to the buffer. Hopefully it
* will remain in shared memory for as long as it takes to scan the index
* buffer page.
*/ */
static void static void
_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp) _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)

View File

@ -701,7 +701,7 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
* there is, in general, a good chance that even large groups of unused line * there is, in general, a good chance that even large groups of unused line
* pointers that we see here will be recycled quickly. * pointers that we see here will be recycled quickly.
* *
* Caller had better have a super-exclusive lock on page's buffer. As a side * Caller had better have a full cleanup lock on page's buffer. As a side
* effect the page's PD_HAS_FREE_LINES hint bit will be set or unset as * effect the page's PD_HAS_FREE_LINES hint bit will be set or unset as
* needed. * needed.
*/ */
@ -820,9 +820,9 @@ PageRepairFragmentation(Page page)
* arbitrary, but it seems like a good idea to avoid leaving a PageIsEmpty() * arbitrary, but it seems like a good idea to avoid leaving a PageIsEmpty()
* page behind. * page behind.
* *
* Caller can have either an exclusive lock or a super-exclusive lock on * Caller can have either an exclusive lock or a full cleanup lock on page's
* page's buffer. The page's PD_HAS_FREE_LINES hint bit will be set or unset * buffer. The page's PD_HAS_FREE_LINES hint bit will be set or unset based
* based on whether or not we leave behind any remaining LP_UNUSED items. * on whether or not we leave behind any remaining LP_UNUSED items.
*/ */
void void
PageTruncateLinePointerArray(Page page) PageTruncateLinePointerArray(Page page)

View File

@ -238,7 +238,7 @@ typedef struct xl_heap_update
* Note that nunused is not explicitly stored, but may be found by reference * Note that nunused is not explicitly stored, but may be found by reference
* to the total record length. * to the total record length.
* *
* Requires a super-exclusive lock. * Acquires a full cleanup lock.
*/ */
typedef struct xl_heap_prune typedef struct xl_heap_prune
{ {
@ -252,9 +252,9 @@ typedef struct xl_heap_prune
/* /*
* The vacuum page record is similar to the prune record, but can only mark * The vacuum page record is similar to the prune record, but can only mark
* already dead items as unused * already LP_DEAD items LP_UNUSED (during VACUUM's second heap pass)
* *
* Used by heap vacuuming only. Does not require a super-exclusive lock. * Acquires an ordinary exclusive lock only.
*/ */
typedef struct xl_heap_vacuum typedef struct xl_heap_vacuum
{ {