mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-25 13:17:41 +03:00 
			
		
		
		
	Add nbtree skip scan optimization.
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns.  When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan.  The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays.  Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys.  This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan).  In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index.  Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument.  These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66".  Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66".  However,
improved performance isn't guaranteed.  Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
			
			
This commit is contained in:
		| @@ -214,7 +214,8 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan); | ||||
|  */ | ||||
|  | ||||
| /* estimate size of parallel scan descriptor */ | ||||
| typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys); | ||||
| typedef Size (*amestimateparallelscan_function) (Relation indexRelation, | ||||
| 												 int nkeys, int norderbys); | ||||
|  | ||||
| /* prepare for parallel index scan */ | ||||
| typedef void (*aminitparallelscan_function) (void *target); | ||||
|   | ||||
| @@ -24,6 +24,7 @@ | ||||
| #include "lib/stringinfo.h" | ||||
| #include "storage/bufmgr.h" | ||||
| #include "storage/shm_toc.h" | ||||
| #include "utils/skipsupport.h" | ||||
|  | ||||
| /* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */ | ||||
| typedef uint16 BTCycleId; | ||||
| @@ -707,6 +708,10 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup) | ||||
|  *	(BTOPTIONS_PROC).  These procedures define a set of user-visible | ||||
|  *	parameters that can be used to control operator class behavior.  None of | ||||
|  *	the built-in B-Tree operator classes currently register an "options" proc. | ||||
|  * | ||||
|  *	To facilitate more efficient B-Tree skip scans, an operator class may | ||||
|  *	choose to offer a sixth amproc procedure (BTSKIPSUPPORT_PROC).  For full | ||||
|  *	details, see src/include/utils/skipsupport.h. | ||||
|  */ | ||||
|  | ||||
| #define BTORDER_PROC		1 | ||||
| @@ -714,7 +719,8 @@ BTreeTupleGetMaxHeapTID(IndexTuple itup) | ||||
| #define BTINRANGE_PROC		3 | ||||
| #define BTEQUALIMAGE_PROC	4 | ||||
| #define BTOPTIONS_PROC		5 | ||||
| #define BTNProcs			5 | ||||
| #define BTSKIPSUPPORT_PROC	6 | ||||
| #define BTNProcs			6 | ||||
|  | ||||
| /* | ||||
|  *	We need to be able to tell the difference between read and write | ||||
| @@ -1027,10 +1033,21 @@ typedef BTScanPosData *BTScanPos; | ||||
| /* We need one of these for each equality-type SK_SEARCHARRAY scan key */ | ||||
| typedef struct BTArrayKeyInfo | ||||
| { | ||||
| 	/* fields set for both kinds of array (SAOP arrays and skip arrays) */ | ||||
| 	int			scan_key;		/* index of associated key in keyData */ | ||||
| 	int			cur_elem;		/* index of current element in elem_values */ | ||||
| 	int			num_elems;		/* number of elems in current array value */ | ||||
| 	int			num_elems;		/* number of elems (-1 means skip array) */ | ||||
|  | ||||
| 	/* fields set for ScalarArrayOpExpr arrays only */ | ||||
| 	Datum	   *elem_values;	/* array of num_elems Datums */ | ||||
| 	int			cur_elem;		/* index of current element in elem_values */ | ||||
|  | ||||
| 	/* fields set for skip arrays only */ | ||||
| 	int16		attlen;			/* attr's length, in bytes */ | ||||
| 	bool		attbyval;		/* attr's FormData_pg_attribute.attbyval */ | ||||
| 	bool		null_elem;		/* NULL is lowest/highest element? */ | ||||
| 	SkipSupport sksup;			/* skip support (NULL if opclass lacks it) */ | ||||
| 	ScanKey		low_compare;	/* array's > or >= lower bound */ | ||||
| 	ScanKey		high_compare;	/* array's < or <= upper bound */ | ||||
| } BTArrayKeyInfo; | ||||
|  | ||||
| typedef struct BTScanOpaqueData | ||||
| @@ -1119,6 +1136,15 @@ typedef struct BTReadPageState | ||||
|  */ | ||||
| #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */ | ||||
| #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */ | ||||
| #define SK_BT_SKIP		0x00040000	/* skip array on column without input = */ | ||||
|  | ||||
| /* SK_BT_SKIP-only flags (set and unset by array advancement) */ | ||||
| #define SK_BT_MINVAL	0x00080000	/* invalid sk_argument, use low_compare */ | ||||
| #define SK_BT_MAXVAL	0x00100000	/* invalid sk_argument, use high_compare */ | ||||
| #define SK_BT_NEXT		0x00200000	/* positions the scan > sk_argument */ | ||||
| #define SK_BT_PRIOR		0x00400000	/* positions the scan < sk_argument */ | ||||
|  | ||||
| /* Remaps pg_index flag bits to uppermost SK_BT_* byte */ | ||||
| #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */ | ||||
| #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT) | ||||
| #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT) | ||||
| @@ -1165,7 +1191,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull, | ||||
| 					 bool indexUnchanged, | ||||
| 					 struct IndexInfo *indexInfo); | ||||
| extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys); | ||||
| extern Size btestimateparallelscan(int nkeys, int norderbys); | ||||
| extern Size btestimateparallelscan(Relation rel, int nkeys, int norderbys); | ||||
| extern void btinitparallelscan(void *target); | ||||
| extern bool btgettuple(IndexScanDesc scan, ScanDirection dir); | ||||
| extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm); | ||||
|   | ||||
		Reference in New Issue
	
	Block a user