Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute. Index searches can distinguish duplicates by heap TID, since heap TID is always guaranteed to be unique. This general approach has numerous benefits for performance, and is prerequisite to teaching VACUUM to perform "retail index tuple deletion". Naively adding a new attribute to every pivot tuple has unacceptable overhead (it bloats internal pages), so suffix truncation of pivot tuples is added. This will usually truncate away the "extra" heap TID attribute from pivot tuples during a leaf page split, and may also truncate away additional user attributes. This can increase fan-out, especially in a multi-column index. Truncation can only occur at the attribute granularity, which isn't particularly effective, but works well enough for now. A future patch may add support for truncating "within" text attributes by generating truncated key values using new opclass infrastructure. Only new indexes (BTREE_VERSION 4 indexes) will have insertions that treat heap TID as a tiebreaker attribute, or will have pivot tuples undergo suffix truncation during a leaf page split (on-disk compatibility with versions 2 and 3 is preserved). Upgrades to version 4 cannot be performed on-the-fly, unlike upgrades from version 2 to version 3. contrib/amcheck continues to work with version 2 and 3 indexes, while also enforcing stricter invariants when verifying version 4 indexes. These stricter invariants are the same invariants described by "3.1.12 Sequencing" from the Lehman and Yao paper. A later patch will enhance the logic used by nbtree to pick a split point. This patch is likely to negatively impact performance without smarter choices around the precise point to split leaf pages at. Making these two mostly-distinct sets of enhancements into distinct commits seems like it might clarify their design, even though neither commit is particularly useful on its own. The maximum allowed size of new tuples is reduced by an amount equal to the space required to store an extra MAXALIGN()'d TID in a new high key during leaf page splits. The user-facing definition of the "1/3 of a page" restriction is already imprecise, and so does not need to be revised. However, there should be a compatibility note in the v12 release notes. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas, Alexander Korotkov Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
2025-12-07 12:02:30 +03:00 · 2019-03-20 10:04:01 -07:00
parent e5adcb789d
commit dd299df818
29 changed files with 1619 additions and 559 deletions
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);

  <para>
   By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
   (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
   index can also be scanned backward, producing output satisfying
   <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
   the extra columns are trailing columns; making them be leading columns is
   unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
   However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
  </para>

  <para>