1
0
mirror of https://github.com/postgres/postgres.git synced 2025-10-19 15:49:24 +03:00

Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient.  All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan.  Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).

Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem).  They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
This commit is contained in:
Peter Geoghegan
2024-04-06 11:47:10 -04:00
parent ddd9e43a92
commit 5bf748b86b
22 changed files with 3470 additions and 562 deletions

View File

@@ -194,7 +194,7 @@ typedef void (*amrestrpos_function) (IndexScanDesc scan);
*/
/* estimate size of parallel scan descriptor */
typedef Size (*amestimateparallelscan_function) (void);
typedef Size (*amestimateparallelscan_function) (int nkeys, int norderbys);
/* prepare for parallel index scan */
typedef void (*aminitparallelscan_function) (void *target);

View File

@@ -165,7 +165,8 @@ extern void index_rescan(IndexScanDesc scan,
extern void index_endscan(IndexScanDesc scan);
extern void index_markpos(IndexScanDesc scan);
extern void index_restrpos(IndexScanDesc scan);
extern Size index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot);
extern Size index_parallelscan_estimate(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot);
extern void index_parallelscan_initialize(Relation heapRelation,
Relation indexRelation, Snapshot snapshot,
ParallelIndexScanDesc target);

View File

@@ -960,11 +960,20 @@ typedef struct BTScanPosData
* moreLeft and moreRight track whether we think there may be matching
* index entries to the left and right of the current page, respectively.
* We can clear the appropriate one of these flags when _bt_checkkeys()
* returns continuescan = false.
* sets BTReadPageState.continuescan = false.
*/
bool moreLeft;
bool moreRight;
/*
* Direction of the scan at the time that _bt_readpage was called.
*
* Used by btrestrpos to "restore" the scan's array keys by resetting each
* array to its first element's value (first in this scan direction). This
* avoids the need to directly track the array keys in btmarkpos.
*/
ScanDirection dir;
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
* location in the associated tuple storage workspace.
@@ -1022,9 +1031,8 @@ typedef BTScanPosData *BTScanPos;
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
typedef struct BTArrayKeyInfo
{
int scan_key; /* index of associated key in arrayKeyData */
int scan_key; /* index of associated key in keyData */
int cur_elem; /* index of current element in elem_values */
int mark_elem; /* index of marked element in elem_values */
int num_elems; /* number of elems in current array value */
Datum *elem_values; /* array of num_elems Datums */
} BTArrayKeyInfo;
@@ -1037,14 +1045,11 @@ typedef struct BTScanOpaqueData
ScanKey keyData; /* array of preprocessed scan keys */
/* workspace for SK_SEARCHARRAY support */
ScanKey arrayKeyData; /* modified copy of scan->keyData */
bool arraysStarted; /* Started array keys, but have yet to "reach
* past the end" of all arrays? */
int numArrayKeys; /* number of equality-type array keys (-1 if
* there are any unsatisfiable array keys) */
int arrayKeyCount; /* count indicating number of array scan keys
* processed */
int numArrayKeys; /* number of equality-type array keys */
bool needPrimScan; /* New prim scan to continue in current dir? */
bool scanBehind; /* Last array advancement matched -inf attr? */
BTArrayKeyInfo *arrayKeys; /* info about each equality-type array key */
FmgrInfo *orderProcs; /* ORDER procs for required equality keys */
MemoryContext arrayContext; /* scan-lifespan context for array data */
/* info about killed items if any (killedItems is NULL if never used) */
@@ -1075,6 +1080,42 @@ typedef struct BTScanOpaqueData
typedef BTScanOpaqueData *BTScanOpaque;
/*
* _bt_readpage state used across _bt_checkkeys calls for a page
*/
typedef struct BTReadPageState
{
/* Input parameters, set by _bt_readpage for _bt_checkkeys */
ScanDirection dir; /* current scan direction */
OffsetNumber minoff; /* Lowest non-pivot tuple's offset */
OffsetNumber maxoff; /* Highest non-pivot tuple's offset */
IndexTuple finaltup; /* Needed by scans with array keys */
BlockNumber prev_scan_page; /* previous _bt_parallel_release block */
Page page; /* Page being read */
/* Per-tuple input parameters, set by _bt_readpage for _bt_checkkeys */
OffsetNumber offnum; /* current tuple's page offset number */
/* Output parameter, set by _bt_checkkeys for _bt_readpage */
OffsetNumber skip; /* Array keys "look ahead" skip offnum */
bool continuescan; /* Terminate ongoing (primitive) index scan? */
/*
* Input and output parameters, set and unset by both _bt_readpage and
* _bt_checkkeys to manage precheck optimizations
*/
bool prechecked; /* precheck set continuescan to 'true'? */
bool firstmatch; /* at least one match so far? */
/*
* Private _bt_checkkeys state used to manage "look ahead" optimization
* (only used during scans with array keys)
*/
int16 rechecks;
int16 targetdistance;
} BTReadPageState;
/*
* We use some private sk_flags bits in preprocessed scan keys. We're allowed
* to use bits 16-31 (see skey.h). The uppermost bits are copied from the
@@ -1128,7 +1169,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
bool indexUnchanged,
struct IndexInfo *indexInfo);
extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
extern Size btestimateparallelscan(void);
extern Size btestimateparallelscan(int nkeys, int norderbys);
extern void btinitparallelscan(void *target);
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
@@ -1149,10 +1190,12 @@ extern bool btcanreturn(Relation index, int attno);
/*
* prototypes for internal functions in nbtree.c
*/
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno,
bool first);
extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
extern void _bt_parallel_primscan_schedule(IndexScanDesc scan,
BlockNumber prev_scan_page);
/*
* prototypes for functions in nbtdedup.c
@@ -1243,15 +1286,11 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
*/
extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern bool _bt_start_prim_scan(IndexScanDesc scan, ScanDirection dir);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
extern void _bt_mark_array_keys(IndexScanDesc scan);
extern void _bt_restore_array_keys(IndexScanDesc scan);
extern void _bt_preprocess_keys(IndexScanDesc scan);
extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
int tupnatts, ScanDirection dir, bool *continuescan,
bool requiredMatchedByPrecheck, bool haveFirstMatch);
extern bool _bt_checkkeys(IndexScanDesc scan, BTReadPageState *pstate, bool arrayKeys,
IndexTuple tuple, int tupnatts);
extern void _bt_killitems(IndexScanDesc scan);
extern BTCycleId _bt_vacuum_cycleid(Relation rel);
extern BTCycleId _bt_start_vacuum(Relation rel);