mirror of
https://github.com/postgres/postgres.git
synced 2025-11-04 20:11:56 +03:00
Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions. This is "bottom-up
deletion". Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page. This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).
The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates. It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_. Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index. The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.
The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs. That is left as work for a
later commit.
Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing. This increases the number of index tuples deleted
significantly in many cases. The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples. We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).
Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages. That is, it almost never fails
to delete at least a few extra index tuples. It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples. It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).
Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples. It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly). Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.
Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.
No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes. Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade. The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
This commit is contained in:
@@ -176,24 +176,6 @@ typedef struct xl_btree_dedup
|
||||
|
||||
#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(uint16))
|
||||
|
||||
/*
|
||||
* This is what we need to know about delete of individual leaf index tuples.
|
||||
* The WAL record can represent deletion of any number of index tuples on a
|
||||
* single index page when *not* executed by VACUUM. Deletion of a subset of
|
||||
* the TIDs within a posting list tuple is not supported.
|
||||
*
|
||||
* Backup Blk 0: index page
|
||||
*/
|
||||
typedef struct xl_btree_delete
|
||||
{
|
||||
TransactionId latestRemovedXid;
|
||||
uint32 ndeleted;
|
||||
|
||||
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
|
||||
} xl_btree_delete;
|
||||
|
||||
#define SizeOfBtreeDelete (offsetof(xl_btree_delete, ndeleted) + sizeof(uint32))
|
||||
|
||||
/*
|
||||
* This is what we need to know about page reuse within btree. This record
|
||||
* only exists to generate a conflict point for Hot Standby.
|
||||
@@ -211,9 +193,61 @@ typedef struct xl_btree_reuse_page
|
||||
#define SizeOfBtreeReusePage (sizeof(xl_btree_reuse_page))
|
||||
|
||||
/*
|
||||
* This is what we need to know about which TIDs to remove from an individual
|
||||
* posting list tuple during vacuuming. An array of these may appear at the
|
||||
* end of xl_btree_vacuum records.
|
||||
* xl_btree_vacuum and xl_btree_delete records describe deletion of index
|
||||
* tuples on a leaf page. The former variant is used by VACUUM, while the
|
||||
* latter variant is used by the ad-hoc deletions that sometimes take place
|
||||
* when btinsert() is called.
|
||||
*
|
||||
* The records are very similar. The only difference is that xl_btree_delete
|
||||
* has to include a latestRemovedXid field to generate recovery conflicts.
|
||||
* (VACUUM operations can just rely on earlier conflicts generated during
|
||||
* pruning of the table whose TIDs the to-be-deleted index tuples point to.
|
||||
* There are also small differences between each REDO routine that we don't go
|
||||
* into here.)
|
||||
*
|
||||
* xl_btree_vacuum and xl_btree_delete both represent deletion of any number
|
||||
* of index tuples on a single leaf page using page offset numbers. Both also
|
||||
* support "updates" of index tuples, which is how deletes of a subset of TIDs
|
||||
* contained in an existing posting list tuple are implemented.
|
||||
*
|
||||
* Updated posting list tuples are represented using xl_btree_update metadata.
|
||||
* The REDO routines each use the xl_btree_update entries (plus each
|
||||
* corresponding original index tuple from the target leaf page) to generate
|
||||
* the final updated tuple.
|
||||
*
|
||||
* Updates are only used when there will be some remaining TIDs left by the
|
||||
* REDO routine. Otherwise the posting list tuple just gets deleted outright.
|
||||
*/
|
||||
typedef struct xl_btree_vacuum
|
||||
{
|
||||
uint16 ndeleted;
|
||||
uint16 nupdated;
|
||||
|
||||
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
|
||||
} xl_btree_vacuum;
|
||||
|
||||
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
|
||||
|
||||
typedef struct xl_btree_delete
|
||||
{
|
||||
TransactionId latestRemovedXid;
|
||||
uint16 ndeleted;
|
||||
uint16 nupdated;
|
||||
|
||||
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TUPLES METADATA (xl_btree_update) ARRAY FOLLOWS */
|
||||
} xl_btree_delete;
|
||||
|
||||
#define SizeOfBtreeDelete (offsetof(xl_btree_delete, nupdated) + sizeof(uint16))
|
||||
|
||||
/*
|
||||
* The offsets that appear in xl_btree_update metadata are offsets into the
|
||||
* original posting list from tuple, not page offset numbers. These are
|
||||
* 0-based. The page offset number for the original posting list tuple comes
|
||||
* from the main xl_btree_vacuum/xl_btree_delete record.
|
||||
*/
|
||||
typedef struct xl_btree_update
|
||||
{
|
||||
@@ -224,31 +258,6 @@ typedef struct xl_btree_update
|
||||
|
||||
#define SizeOfBtreeUpdate (offsetof(xl_btree_update, ndeletedtids) + sizeof(uint16))
|
||||
|
||||
/*
|
||||
* This is what we need to know about a VACUUM of a leaf page. The WAL record
|
||||
* can represent deletion of any number of index tuples on a single index page
|
||||
* when executed by VACUUM. It can also support "updates" of index tuples,
|
||||
* which is how deletes of a subset of TIDs contained in an existing posting
|
||||
* list tuple are implemented. (Updates are only used when there will be some
|
||||
* remaining TIDs once VACUUM finishes; otherwise the posting list tuple can
|
||||
* just be deleted).
|
||||
*
|
||||
* Updated posting list tuples are represented using xl_btree_update metadata.
|
||||
* The REDO routine uses each xl_btree_update (plus its corresponding original
|
||||
* index tuple from the target leaf page) to generate the final updated tuple.
|
||||
*/
|
||||
typedef struct xl_btree_vacuum
|
||||
{
|
||||
uint16 ndeleted;
|
||||
uint16 nupdated;
|
||||
|
||||
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TARGET OFFSET NUMBERS FOLLOW */
|
||||
/* UPDATED TUPLES METADATA ARRAY FOLLOWS */
|
||||
} xl_btree_vacuum;
|
||||
|
||||
#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
|
||||
|
||||
/*
|
||||
* This is what we need to know about marking an empty subtree for deletion.
|
||||
* The target identifies the tuple removed from the parent page (note that we
|
||||
|
||||
Reference in New Issue
Block a user