Add nbtree skip scan optimization.

Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b8. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
2025-07-28 23:42:10 +03:00 · 2025-04-04 12:27:04 -04:00
parent 3ba2cdaa45
commit 92fe23d93a
35 changed files with 3024 additions and 381 deletions
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@ -207,7 +207,7 @@

 <para>
  As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
-  one required and four optional support functions.  The five
+  one required and five optional support functions.  The six
  user-defined methods are:
 </para>
 <variablelist>
@ -583,6 +583,35 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
    </para>
   </listitem>
  </varlistentry>
+  <varlistentry>
+   <term><function>skipsupport</function></term>
+   <listitem>
+    <para>
+     Optionally, a btree operator family may provide a <firstterm>skip
+      support</firstterm> function, registered under support function number 6.
+     These functions give the B-tree code a way to iterate through every
+     possible value that can be represented by an operator class's underlying
+     input type, in key space order.  This is used by the core code when it
+     applies the skip scan optimization.  The APIs involved in this are
+     defined in <filename>src/include/utils/skipsupport.h</filename>.
+    </para>
+    <para>
+     Operator classes that do not provide a skip support function are still
+     eligible to use skip scan.  The core code can still use its fallback
+     strategy, though that might be suboptimal for some discrete types.  It
+     usually doesn't make sense (and may not even be feasible) for operator
+     classes on continuous types to provide a skip support function.
+    </para>
+    <para>
+     It is not sensible for an operator family to register a cross-type
+     <function>skipsupport</function> function, and attempting to do so will
+     result in an error.  This is because determining the next indexable value
+     must happen by incrementing a value copied from an index tuple.  The
+     values generated must all be of the same underlying data type (the
+     <quote>skipped</quote> index column's opclass input type).
+    </para>
+   </listitem>
+  </varlistentry>
 </variablelist>

 </sect2>
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@ -835,7 +835,8 @@ amrestrpos (IndexScanDesc scan);
  <para>
 <programlisting>
 Size
-amestimateparallelscan (int nkeys,
+amestimateparallelscan (Relation indexRelation,
+                        int nkeys,
                        int norderbys);
 </programlisting>
   Estimate and return the number of bytes of dynamic shared memory which
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@ -460,20 +460,56 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
   efficient when there are constraints on the leading (leftmost) columns.
   The exact rule is that equality constraints on leading columns, plus
   any inequality constraints on the first column that does not have an
-   equality constraint, will be used to limit the portion of the index
+   equality constraint, will always be used to limit the portion of the index
   that is scanned.  Constraints on columns to the right of these columns
-   are checked in the index, so they save visits to the table proper, but
-   they do not reduce the portion of the index that has to be scanned.
+   are checked in the index, so they'll always save visits to the table
+   proper, but they do not necessarily reduce the portion of the index that
+   has to be scanned.  If a B-tree index scan can apply the skip scan
+   optimization effectively, it will apply every column constraint when
+   navigating through the index via repeated index searches.  This can reduce
+   the portion of the index that has to be read, even though one or more
+   columns (prior to the least significant index column from the query
+   predicate) lacks a conventional equality constraint.  Skip scan works by
+   generating a dynamic equality constraint internally, that matches every
+   possible value in an index column (though only given a column that lacks
+   an equality constraint that comes from the query predicate, and only when
+   the generated constraint can be used in conjunction with a later column
+   constraint from the query predicate).
+  </para>
+
+  <para>
+   For example, given an index on <literal>(x, y)</literal>, and a query
+   condition <literal>WHERE y = 7700</literal>, a B-tree index scan might be
+   able to apply the skip scan optimization.  This generally happens when the
+   query planner expects that repeated <literal>WHERE x = N AND y = 7700</literal>
+   searches for every possible value of <literal>N</literal> (or for every
+   <literal>x</literal> value that is actually stored in the index) is the
+   fastest possible approach, given the available indexes on the table.  This
+   approach is generally only taken when there are so few distinct
+   <literal>x</literal> values that the planner expects the scan to skip over
+   most of the index (because most of its leaf pages cannot possibly contain
+   relevant tuples).  If there are many distinct <literal>x</literal> values,
+   then the entire index will have to be scanned, so in most cases the planner
+   will prefer a sequential table scan over using the index.
+  </para>
+
+  <para>
+   The skip scan optimization can also be applied selectively, during B-tree
+   scans that have at least some useful constraints from the query predicate.
   For example, given an index on <literal>(a, b, c)</literal> and a
   query condition <literal>WHERE a = 5 AND b &gt;= 42 AND c &lt; 77</literal>,
-   the index would have to be scanned from the first entry with
-   <literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
-   <literal>a</literal> = 5.  Index entries with <literal>c</literal> &gt;= 77 would be
-   skipped, but they'd still have to be scanned through.
-   This index could in principle be used for queries that have constraints
-   on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
-   &mdash; but the entire index would have to be scanned, so in most cases
-   the planner would prefer a sequential table scan over using the index.
+   the index might have to be scanned from the first entry with
+   <literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last
+   entry with <literal>a</literal> = 5.  Index entries with
+   <literal>c</literal> &gt;= 77 will never need to be filtered at the table
+   level, but it may or may not be profitable to skip over them within the
+   index.  When skipping takes place, the scan starts a new index search to
+   reposition itself from the end of the current <literal>a</literal> = 5 and
+   <literal>b</literal> = N grouping (i.e. from the position in the index
+   where the first tuple <literal>a = 5 AND b = N AND c &gt;= 77</literal>
+   appears), to the start of the next such grouping (i.e. the position in the
+   index where the first tuple <literal>a = 5 AND b = N + 1</literal>
+   appears).
  </para>

  <para>
@ -669,9 +705,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
   multicolumn index on <literal>(x, y)</literal>.  This index would typically be
   more efficient than index combination for queries involving both
   columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
-   would be almost useless for queries involving only <literal>y</literal>, so it
-   should not be the only index.  A combination of the multicolumn index
-   and a separate index on <literal>y</literal> would serve reasonably well.  For
+   would be less useful for queries involving only <literal>y</literal>.  Just
+   how useful will depend on how effective the B-tree index skip scan
+   optimization is; if <literal>x</literal> has no more than several hundred
+   distinct values, skip scan will make searches for specific
+   <literal>y</literal> values execute reasonably efficiently.  A combination
+   of a multicolumn index on <literal>(x, y)</literal> and a separate index on
+   <literal>y</literal> might also serve reasonably well.  For
   queries involving only <literal>x</literal>, the multicolumn index could be
   used, though it would be larger and hence slower than an index on
   <literal>x</literal> alone.  The last alternative is to create all three
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@ -4263,7 +4263,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
     <replaceable>column_name</replaceable> =
     <replaceable>value2</replaceable> ...</literal> construct, though only
    when the optimizer transforms the construct into an equivalent
-    multi-valued array representation.
+    multi-valued array representation.  Similarly, when B-tree index scans use
+    the skip scan optimization, an index search is performed each time the
+    scan is repositioned to the next index leaf page that might have matching
+    tuples (see <xref linkend="indexes-multicolumn"/>).
   </para>
  </note>
  <tip>
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@ -860,6 +860,38 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE thousand IN (1, 2, 3, 4);
    <structname>tenk1_thous_tenthous</structname> index leaf page.
   </para>

+   <para>
+    The <quote>Index Searches</quote> line is also useful with B-tree index
+    scans that apply the <firstterm>skip scan</firstterm> optimization to
+    more efficiently traverse through an index:
+<screen>
+EXPLAIN ANALYZE SELECT four, unique1 FROM tenk1 WHERE four BETWEEN 1 AND 3 AND unique1 = 42;
+                                                              QUERY PLAN
+-------------------------------------------------------------------&zwsp;---------------------------------------------------------------
+ Index Only Scan using tenk1_four_unique1_idx on tenk1  (cost=0.29..6.90 rows=1 width=8) (actual time=0.006..0.007 rows=1.00 loops=1)
+   Index Cond: ((four &gt;= 1) AND (four &lt;= 3) AND (unique1 = 42))
+   Heap Fetches: 0
+   Index Searches: 3
+   Buffers: shared hit=7
+ Planning Time: 0.029 ms
+ Execution Time: 0.012 ms
+</screen>
+
+    Here we see an Index-Only Scan node using
+    <structname>tenk1_four_unique1_idx</structname>, a multi-column index on the
+    <structname>tenk1</structname> table's <structfield>four</structfield> and
+    <structfield>unique1</structfield> columns.  The scan performs 3 searches
+    that each read a single index leaf page:
+    <quote><literal>four = 1 AND unique1 = 42</literal></quote>,
+    <quote><literal>four = 2 AND unique1 = 42</literal></quote>, and
+    <quote><literal>four = 3 AND unique1 = 42</literal></quote>.  This index
+    is generally a good target for skip scan, since, as discussed in
+    <xref linkend="indexes-multicolumn"/>, its leading column (the
+    <structfield>four</structfield> column) contains only 4 distinct values,
+    while its second/final column (the <structfield>unique1</structfield>
+    column) contains many distinct values.
+   </para>
+
   <para>
    Another type of extra information is the number of rows removed by a
    filter condition:
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@ -461,6 +461,13 @@
       </entry>
       <entry>5</entry>
      </row>
+      <row>
+       <entry>
+        Return the addresses of C-callable skip support function(s)
+        (optional)
+       </entry>
+       <entry>6</entry>
+      </row>
     </tbody>
    </tgroup>
   </table>
@ -1062,7 +1069,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
  FUNCTION 1 btint8cmp(int8, int8) ,
  FUNCTION 2 btint8sortsupport(internal) ,
  FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
-  FUNCTION 4 btequalimage(oid) ;
+  FUNCTION 4 btequalimage(oid) ,
+  FUNCTION 6 btint8skipsupport(internal) ;

 CREATE OPERATOR CLASS int4_ops
 DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@ -1075,7 +1083,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
  FUNCTION 1 btint4cmp(int4, int4) ,
  FUNCTION 2 btint4sortsupport(internal) ,
  FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
-  FUNCTION 4 btequalimage(oid) ;
+  FUNCTION 4 btequalimage(oid) ,
+  FUNCTION 6 btint4skipsupport(internal) ;

 CREATE OPERATOR CLASS int2_ops
 DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@ -1088,7 +1097,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
  FUNCTION 1 btint2cmp(int2, int2) ,
  FUNCTION 2 btint2sortsupport(internal) ,
  FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
-  FUNCTION 4 btequalimage(oid) ;
+  FUNCTION 4 btequalimage(oid) ,
+  FUNCTION 6 btint2skipsupport(internal) ;

 ALTER OPERATOR FAMILY integer_ops USING btree ADD
  -- cross-type comparisons int8 vs int2