Enhance nbtree index tuple deletion.

Teach nbtree and heapam to cooperate in order to eagerly remove duplicate tuples representing dead MVCC versions. This is "bottom-up deletion". Each bottom-up deletion pass is triggered lazily in response to a flood of versions on an nbtree leaf page. This usually involves a "logically unchanged index" hint (these are produced by the executor mechanism added by commit 9dc718bd). The immediate goal of bottom-up index deletion is to avoid "unnecessary" page splits caused entirely by version duplicates. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The overall effect is to avoid certain pathological performance issues related to "version churn" from UPDATEs. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion" to clearly distinguish it from bottom-up deletion) won't usually need to visit any extra table blocks to check these extra tuples. We have to visit the same table blocks anyway to generate a latestRemovedXid value (at least in the common case where the index deletion operation's WAL record needs such a value). Testing has shown that the "extra tuples" simple deletion enhancement increases the number of index tuples deleted with almost any workload that has LP_DEAD bits set in leaf pages. That is, it almost never fails to delete at least a few extra index tuples. It helps most of all in cases that happen to naturally have a lot of delete-safe tuples. It's not uncommon for an individual deletion operation to end up deleting an order of magnitude more index tuples compared to the old naive approach (e.g., custom instrumentation of the patch shows that this happens fairly often when the regression tests are run). Add a further enhancement that augments simple deletion and bottom-up deletion in indexes that make use of deduplication: Teach nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. It is now possible to delete individual TIDs from posting list tuples provided the TIDs have a tableam block number of a table block that gets visited as part of the deletion process (visiting the table block can be triggered directly or indirectly). Setting the LP_DEAD bit of a posting list tuple is still an all-or-nothing thing, but that matters much less now that deletion only needs to start out with the right _general_ idea about which index tuples are deletable. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from bottom-up index deletion (i.e. no reindexing required) following a pg_upgrade. The enhancement to simple deletion is available with all B-Tree indexes following a pg_upgrade, no matter what PostgreSQL version the user upgrades from. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
2025-08-28 18:48:04 +03:00 · 2021-01-13 09:21:32 -08:00
parent 9dc718bdf2
commit d168b66682
19 changed files with 2120 additions and 450 deletions
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     <para>
      The fillfactor for an index is a percentage that determines how full
      the index method will try to pack index pages.  For B-trees, leaf pages
-      are filled to this percentage during initial index build, and also
+      are filled to this percentage during initial index builds, and also
      when extending the index at the right (adding new largest key values).
      If pages
      subsequently become completely full, they will be split, leading to
-      gradual degradation in the index's efficiency.  B-trees use a default
+      fragmentation of the on-disk index structure.  B-trees use a default
      fillfactor of 90, but any integer value from 10 to 100 can be selected.
-      If the table is static then fillfactor 100 is best to minimize the
-      index's physical size, but for heavily updated tables a smaller
-      fillfactor is better to minimize the need for page splits.  The
-      other index methods use fillfactor in different but roughly analogous
-      ways; the default fillfactor varies between methods.
+     </para>
+     <para>
+      B-tree indexes on tables where many inserts and/or updates are
+      anticipated can benefit from lower fillfactor settings at
+      <command>CREATE INDEX</command> time (following bulk loading into the
+      table).  Values in the range of 50 - 90 can usefully <quote>smooth
+       out</quote> the <emphasis>rate</emphasis> of page splits during the
+      early life of the B-tree index (lowering fillfactor like this may even
+      lower the absolute number of page splits, though this effect is highly
+      workload dependent).  The B-tree bottom-up index deletion technique
+      described in <xref linkend="btree-deletion"/> is dependent on having
+      some <quote>extra</quote> space on pages to store <quote>extra</quote>
+      tuple versions, and so can be affected by fillfactor (though the effect
+      is usually not significant).
+     </para>
+     <para>
+      In other specific cases it might be useful to increase fillfactor to
+      100 at <command>CREATE INDEX</command> time as a way of maximizing
+      space utilization.  You should only consider this when you are
+      completely sure that the table is static (i.e. that it will never be
+      affected by either inserts or updates).  A fillfactor setting of 100
+      otherwise risks <emphasis>harming</emphasis> performance: even a few
+      updates or inserts will cause a sudden flood of page splits.
+     </para>
+     <para>
+      The other index methods use fillfactor in different but roughly
+      analogous ways; the default fillfactor varies between methods.
     </para>
    </listitem>
   </varlistentry>