1
0
mirror of https://github.com/postgres/postgres.git synced 2025-10-25 13:17:41 +03:00

Make row compares robust during nbtree array scans.

Recent nbtree bugfix commit 5f4d98d4 added a special case to the code
that sets up a page-level prefix of keys that are definitely satisfied
by every tuple on the page: whenever _bt_set_startikey reached a row
compare key, we'd refuse to apply the pstate.forcenonrequired behavior
in scans where that usually happens (scans with a higher-order array
key).  That hack made the scan avoid essentially the same infinite
cycling behavior that also affected nbtree scans with redundant keys
(keys that preprocessing could not eliminate) prior to commit f09816a0.
There are now serious doubts about this row compare workaround.

Testing has shown that a scan with a row compare key and an array key
could still read the same leaf page twice (without the scan's direction
changing), which isn't supposed to be possible following the SAOP
enhancements added by Postgres 17 commit 5bf748b8.  Also, we still
allowed a required row compare key to be used with forcenonrequired mode
when its header key happened to be beyond the pstate.ikey set by
_bt_set_startikey, which was complicated and brittle.

The underlying problem was that row compares had inconsistent rules
around how scans start (which keys can be used for initial positioning
purposes) and how scans end (which keys can set continuescan=false).
Quals with redundant keys that could not be eliminated by preprocessing
also had that same quality to them prior to today's bugfix f09816a0.  It
now seems prudent to bring row compare keys in line with the new charter
for required keys, by making the start and end rules symmetric.

This commit fixes two points of disagreement between _bt_first and
_bt_check_rowcompare.  Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after a required row compare
member.  _bt_first now has symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes.  It could use fewer or more keys than
_bt_check_rowcompare.  _bt_first now uses the same requiredness markings
as _bt_check_rowcompare for this.

Now that _bt_first and _bt_check_rowcompare agree on how to start and
end scans, we can get rid of the forcenonrequired special case, without
any risk of infinite cycling.  This approach also makes row compare keys
behave more like regular scalar keys, particularly within _bt_first.

Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares were marked required by preprocessing: we
didn't mark any lower-order row members required following 2016 bugfix
commit a298a1e0.  That approach was over broad.  The bug in question was
actually an oversight in how _bt_check_rowcompare dealt with tuple NULL
values that failed to satisfy a scan key marked required in the opposite
scan direction (it was a bug in 2011 commits 6980f817 and 882368e8, not
a bug in 2006 commit 3a0a16cb).  Go back to marking row compare members
as required using the original 2006 rules, and fix the 2016 bug in a
more principled way: by limiting use of the "set continuescan=false with
a key required in the opposite scan direction upon encountering a NULL
tuple value" optimization to the first/most significant row member key.
While it isn't safe to use an implied IS NOT NULL qualifier to end the
scan when it comes from a required lower-order row compare member key,
it _is_ generally safe for such a required member key to end the scan --
provided the key is marked required in the _current_ scan direction.

This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.  It is a direct follow-up to today's commit f09816a0.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
Backpatch-through: 18
This commit is contained in:
Peter Geoghegan
2025-07-02 09:48:15 -04:00
parent f09816a0a7
commit bd3f59fdb7
5 changed files with 412 additions and 229 deletions

View File

@@ -792,12 +792,25 @@ _bt_mark_scankey_required(ScanKey skey)
if (skey->sk_flags & SK_ROW_HEADER)
{
ScanKey subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
AttrNumber attno = skey->sk_attno;
/* First subkey should be same column/operator as the header */
Assert(subkey->sk_flags & SK_ROW_MEMBER);
Assert(subkey->sk_attno == skey->sk_attno);
Assert(subkey->sk_attno == attno);
Assert(subkey->sk_strategy == skey->sk_strategy);
subkey->sk_flags |= addflags;
for (;;)
{
Assert(subkey->sk_flags & SK_ROW_MEMBER);
if (subkey->sk_attno != attno)
break; /* non-adjacent key, so not required */
if (subkey->sk_strategy != skey->sk_strategy)
break; /* wrong direction, so not required */
subkey->sk_flags |= addflags;
if (subkey->sk_flags & SK_ROW_END)
break;
subkey++;
attno++;
}
}
}

View File

@@ -1016,8 +1016,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* traversing a lot of null entries at the start of the scan.
*
* In this loop, row-comparison keys are treated the same as keys on their
* first (leftmost) columns. We'll add on lower-order columns of the row
* comparison below, if possible.
* first (leftmost) columns. We'll add all lower-order columns of the row
* comparison that were marked required during preprocessing below.
*
* _bt_advance_array_keys needs to know exactly how we'll reposition the
* scan (should it opt to schedule another primitive index scan). It is
@@ -1261,16 +1261,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
Assert(keysz <= INDEX_MAX_KEYS);
for (int i = 0; i < keysz; i++)
{
ScanKey cur = startKeys[i];
ScanKey bkey = startKeys[i];
Assert(cur->sk_attno == i + 1);
Assert(bkey->sk_attno == i + 1);
if (cur->sk_flags & SK_ROW_HEADER)
if (bkey->sk_flags & SK_ROW_HEADER)
{
/*
* Row comparison header: look to the first row member instead
*/
ScanKey subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
ScanKey subkey = (ScanKey) DatumGetPointer(bkey->sk_argument);
bool loosen_strat = false,
tighten_strat = false;
/*
* Cannot be a NULL in the first row member: _bt_preprocess_keys
@@ -1278,9 +1280,18 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* ever getting this far
*/
Assert(subkey->sk_flags & SK_ROW_MEMBER);
Assert(subkey->sk_attno == cur->sk_attno);
Assert(subkey->sk_attno == bkey->sk_attno);
Assert(!(subkey->sk_flags & SK_ISNULL));
/*
* This is either a > or >= key (during backwards scans it is
* either < or <=) that was marked required during preprocessing.
* Later so->keyData[] keys can't have been marked required, so
* our row compare header key must be the final startKeys[] entry.
*/
Assert(subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD));
Assert(i == keysz - 1);
/*
* The member scankeys are already in insertion format (ie, they
* have sk_func = 3-way-comparison function)
@@ -1288,112 +1299,141 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
/*
* If the row comparison is the last positioning key we accepted,
* try to add additional keys from the lower-order row members.
* (If we accepted independent conditions on additional index
* columns, we use those instead --- doesn't seem worth trying to
* determine which is more restrictive.) Note that this is OK
* even if the row comparison is of ">" or "<" type, because the
* condition applied to all but the last row member is effectively
* ">=" or "<=", and so the extra keys don't break the positioning
* scheme. But, by the same token, if we aren't able to use all
* the row members, then the part of the row comparison that we
* did use has to be treated as just a ">=" or "<=" condition, and
* so we'd better adjust strat_total accordingly.
* Now look to later row compare members.
*
* If there's an "index attribute gap" between two row compare
* members, the second member won't have been marked required, and
* so can't be used as a starting boundary key here. The part of
* the row comparison that we do still use has to be treated as a
* ">=" or "<=" condition. For example, a qual "(a, c) > (1, 42)"
* with an omitted intervening index attribute "b" will use an
* insertion scan key "a >= 1". Even the first "a = 1" tuple on
* the leaf level might satisfy the row compare qual.
*
* We're able to use a _more_ restrictive strategy when we reach a
* NULL row compare member, since they're always unsatisfiable.
* For example, a qual "(a, b, c) >= (1, NULL, 77)" will use an
* insertion scan key "a > 1". All tuples where "a = 1" cannot
* possibly satisfy the row compare qual, so this is safe.
*/
if (i == keysz - 1)
Assert(!(subkey->sk_flags & SK_ROW_END));
for (;;)
{
bool used_all_subkeys = false;
subkey++;
Assert(subkey->sk_flags & SK_ROW_MEMBER);
Assert(!(subkey->sk_flags & SK_ROW_END));
for (;;)
if (subkey->sk_flags & SK_ISNULL)
{
subkey++;
Assert(subkey->sk_flags & SK_ROW_MEMBER);
if (subkey->sk_attno != keysz + 1)
break; /* out-of-sequence, can't use it */
if (subkey->sk_strategy != cur->sk_strategy)
break; /* wrong direction, can't use it */
if (subkey->sk_flags & SK_ISNULL)
break; /* can't use null keys */
Assert(keysz < INDEX_MAX_KEYS);
memcpy(inskey.scankeys + keysz, subkey,
sizeof(ScanKeyData));
keysz++;
if (subkey->sk_flags & SK_ROW_END)
{
used_all_subkeys = true;
break;
}
/*
* NULL member key, can only use earlier keys.
*
* We deliberately avoid checking if this key is marked
* required. All earlier keys are required, and this key
* is unsatisfiable either way, so we can't miss anything.
*/
tighten_strat = true;
break;
}
if (!used_all_subkeys)
if (!(subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)))
{
switch (strat_total)
{
case BTLessStrategyNumber:
strat_total = BTLessEqualStrategyNumber;
break;
case BTGreaterStrategyNumber:
strat_total = BTGreaterEqualStrategyNumber;
break;
}
/* nonrequired member key, can only use earlier keys */
loosen_strat = true;
break;
}
break; /* done with outer loop */
Assert(subkey->sk_attno == keysz + 1);
Assert(subkey->sk_strategy == bkey->sk_strategy);
Assert(keysz < INDEX_MAX_KEYS);
memcpy(inskey.scankeys + keysz, subkey,
sizeof(ScanKeyData));
keysz++;
if (subkey->sk_flags & SK_ROW_END)
break;
}
Assert(!(loosen_strat && tighten_strat));
if (loosen_strat)
{
/* Use less restrictive strategy (and fewer member keys) */
switch (strat_total)
{
case BTLessStrategyNumber:
strat_total = BTLessEqualStrategyNumber;
break;
case BTGreaterStrategyNumber:
strat_total = BTGreaterEqualStrategyNumber;
break;
}
}
if (tighten_strat)
{
/* Use more restrictive strategy (and fewer member keys) */
switch (strat_total)
{
case BTLessEqualStrategyNumber:
strat_total = BTLessStrategyNumber;
break;
case BTGreaterEqualStrategyNumber:
strat_total = BTGreaterStrategyNumber;
break;
}
}
/* done adding to inskey (row comparison keys always come last) */
break;
}
/*
* Ordinary comparison key/search-style key.
*
* Transform the search-style scan key to an insertion scan key by
* replacing the sk_func with the appropriate btree 3-way-comparison
* function.
*
* If scankey operator is not a cross-type comparison, we can use the
* cached comparison function; otherwise gotta look it up in the
* catalogs. (That can't lead to infinite recursion, since no
* indexscan initiated by syscache lookup will use cross-data-type
* operators.)
*
* We support the convention that sk_subtype == InvalidOid means the
* opclass input type; this hack simplifies life for ScanKeyInit().
*/
if (bkey->sk_subtype == rel->rd_opcintype[i] ||
bkey->sk_subtype == InvalidOid)
{
FmgrInfo *procinfo;
procinfo = index_getprocinfo(rel, bkey->sk_attno, BTORDER_PROC);
ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
bkey->sk_flags,
bkey->sk_attno,
InvalidStrategy,
bkey->sk_subtype,
bkey->sk_collation,
procinfo,
bkey->sk_argument);
}
else
{
/*
* Ordinary comparison key. Transform the search-style scan key
* to an insertion scan key by replacing the sk_func with the
* appropriate btree comparison function.
*
* If scankey operator is not a cross-type comparison, we can use
* the cached comparison function; otherwise gotta look it up in
* the catalogs. (That can't lead to infinite recursion, since no
* indexscan initiated by syscache lookup will use cross-data-type
* operators.)
*
* We support the convention that sk_subtype == InvalidOid means
* the opclass input type; this is a hack to simplify life for
* ScanKeyInit().
*/
if (cur->sk_subtype == rel->rd_opcintype[i] ||
cur->sk_subtype == InvalidOid)
{
FmgrInfo *procinfo;
RegProcedure cmp_proc;
procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
cur->sk_flags,
cur->sk_attno,
InvalidStrategy,
cur->sk_subtype,
cur->sk_collation,
procinfo,
cur->sk_argument);
}
else
{
RegProcedure cmp_proc;
cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
rel->rd_opcintype[i],
cur->sk_subtype,
BTORDER_PROC);
if (!RegProcedureIsValid(cmp_proc))
elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
cur->sk_attno, RelationGetRelationName(rel));
ScanKeyEntryInitialize(inskey.scankeys + i,
cur->sk_flags,
cur->sk_attno,
InvalidStrategy,
cur->sk_subtype,
cur->sk_collation,
cmp_proc,
cur->sk_argument);
}
cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
rel->rd_opcintype[i],
bkey->sk_subtype, BTORDER_PROC);
if (!RegProcedureIsValid(cmp_proc))
elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
BTORDER_PROC, rel->rd_opcintype[i], bkey->sk_subtype,
bkey->sk_attno, RelationGetRelationName(rel));
ScanKeyEntryInitialize(inskey.scankeys + i,
bkey->sk_flags,
bkey->sk_attno,
InvalidStrategy,
bkey->sk_subtype,
bkey->sk_collation,
cmp_proc,
bkey->sk_argument);
}
}
@@ -1482,6 +1522,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
if (!BufferIsValid(so->currPos.buf))
{
Assert(!so->needPrimScan);
/*
* We only get here if the index is completely empty. Lock relation
* because nothing finer to lock exists. Without a buffer lock, it's
@@ -1500,7 +1542,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
if (!BufferIsValid(so->currPos.buf))
{
Assert(!so->needPrimScan);
_bt_parallel_done(scan);
return false;
}

View File

@@ -2442,32 +2442,8 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
}
if (key->sk_flags & SK_ROW_HEADER)
{
/*
* RowCompare inequality.
*
* Only the first subkey from a RowCompare can ever be marked
* required (that happens when the row header is marked required).
* There is no simple, general way for us to transitively deduce
* whether or not every tuple on the page satisfies a RowCompare
* key based only on firsttup and lasttup -- so we just give up.
*/
if (!start_past_saop_eq && !so->skipScan)
break; /* unsafe to go further */
/*
* We have to be even more careful with RowCompares that come
* after an array: we assume it's unsafe to even bypass the array.
* Calling _bt_start_array_keys to recover the scan's arrays
* following use of forcenonrequired mode isn't compatible with
* _bt_check_rowcompare's continuescan=false behavior with NULL
* row compare members. _bt_advance_array_keys must not make a
* decision on the basis of a key not being satisfied in the
* opposite-to-scan direction until the scan reaches a leaf page
* where the same key begins to be satisfied in scan direction.
* The _bt_first !used_all_subkeys behavior makes this limitation
* hard to work around some other way.
*/
return; /* completely unsafe to set pstate.startikey */
/* RowCompare inequalities currently aren't supported */
break; /* "unsafe" */
}
if (key->sk_strategy != BTEqualStrategyNumber)
{
@@ -2964,76 +2940,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
Assert(subkey->sk_flags & SK_ROW_MEMBER);
if (subkey->sk_attno > tupnatts)
{
/*
* This attribute is truncated (must be high key). The value for
* this attribute in the first non-pivot tuple on the page to the
* right could be any possible value. Assume that truncated
* attribute passes the qual.
*/
Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
subkey++;
continue;
}
datum = index_getattr(tuple,
subkey->sk_attno,
tupdesc,
&isNull);
if (isNull)
{
if (forcenonrequired)
{
/* treating scan's keys as non-required */
}
else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
{
/*
* Since NULLs are sorted before non-NULLs, we know we have
* reached the lower limit of the range of values for this
* index attr. On a backward scan, we can stop if this qual
* is one of the "must match" subset. We can stop regardless
* of whether the qual is > or <, so long as it's required,
* because it's not possible for any future tuples to pass. On
* a forward scan, however, we must keep going, because we may
* have initially positioned to the start of the index.
* (_bt_advance_array_keys also relies on this behavior during
* forward scans.)
*/
if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
ScanDirectionIsBackward(dir))
*continuescan = false;
}
else
{
/*
* Since NULLs are sorted after non-NULLs, we know we have
* reached the upper limit of the range of values for this
* index attr. On a forward scan, we can stop if this qual is
* one of the "must match" subset. We can stop regardless of
* whether the qual is > or <, so long as it's required,
* because it's not possible for any future tuples to pass. On
* a backward scan, however, we must keep going, because we
* may have initially positioned to the end of the index.
* (_bt_advance_array_keys also relies on this behavior during
* backward scans.)
*/
if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
ScanDirectionIsForward(dir))
*continuescan = false;
}
/*
* In any case, this indextuple doesn't match the qual.
*/
return false;
}
/* When a NULL row member is compared, the row never matches */
if (subkey->sk_flags & SK_ISNULL)
{
/*
@@ -3058,6 +2965,114 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
return false;
}
if (subkey->sk_attno > tupnatts)
{
/*
* This attribute is truncated (must be high key). The value for
* this attribute in the first non-pivot tuple on the page to the
* right could be any possible value. Assume that truncated
* attribute passes the qual.
*/
Assert(BTreeTupleIsPivot(tuple));
return true;
}
datum = index_getattr(tuple,
subkey->sk_attno,
tupdesc,
&isNull);
if (isNull)
{
int reqflags;
if (forcenonrequired)
{
/* treating scan's keys as non-required */
}
else if (subkey->sk_flags & SK_BT_NULLS_FIRST)
{
/*
* Since NULLs are sorted before non-NULLs, we know we have
* reached the lower limit of the range of values for this
* index attr. On a backward scan, we can stop if this qual
* is one of the "must match" subset. However, on a forwards
* scan, we must keep going, because we may have initially
* positioned to the start of the index.
*
* All required NULLS FIRST > row members can use NULL tuple
* values to end backwards scans, just like with other values.
* A qual "WHERE (a, b, c) > (9, 42, 'foo')" can terminate a
* backwards scan upon reaching the index's rightmost "a = 9"
* tuple whose "b" column contains a NULL (if not sooner).
* Since "b" is NULLS FIRST, we can treat its NULLs as "<" 42.
*/
reqflags = SK_BT_REQBKWD;
/*
* When a most significant required NULLS FIRST < row compare
* member sees NULL tuple values during a backwards scan, it
* signals the end of matches for the whole row compare/scan.
* A qual "WHERE (a, b, c) < (9, 42, 'foo')" will terminate a
* backwards scan upon reaching the rightmost tuple whose "a"
* column has a NULL. The "a" NULL value is "<" 9, and yet
* our < row compare will still end the scan. (This isn't
* safe with later/lower-order row members. Notice that it
* can only happen with an "a" NULL some time after the scan
* completely stops needing to use its "b" and "c" members.)
*/
if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
reqflags |= SK_BT_REQFWD; /* safe, first row member */
if ((subkey->sk_flags & reqflags) &&
ScanDirectionIsBackward(dir))
*continuescan = false;
}
else
{
/*
* Since NULLs are sorted after non-NULLs, we know we have
* reached the upper limit of the range of values for this
* index attr. On a forward scan, we can stop if this qual is
* one of the "must match" subset. However, on a backward
* scan, we must keep going, because we may have initially
* positioned to the end of the index.
*
* All required NULLS LAST < row members can use NULL tuple
* values to end forwards scans, just like with other values.
* A qual "WHERE (a, b, c) < (9, 42, 'foo')" can terminate a
* forwards scan upon reaching the index's leftmost "a = 9"
* tuple whose "b" column contains a NULL (if not sooner).
* Since "b" is NULLS LAST, we can treat its NULLs as ">" 42.
*/
reqflags = SK_BT_REQFWD;
/*
* When a most significant required NULLS LAST > row compare
* member sees NULL tuple values during a forwards scan, it
* signals the end of matches for the whole row compare/scan.
* A qual "WHERE (a, b, c) > (9, 42, 'foo')" will terminate a
* forwards scan upon reaching the leftmost tuple whose "a"
* column has a NULL. The "a" NULL value is ">" 9, and yet
* our > row compare will end the scan. (This isn't safe with
* later/lower-order row members. Notice that it can only
* happen with an "a" NULL some time after the scan completely
* stops needing to use its "b" and "c" members.)
*/
if (subkey == (ScanKey) DatumGetPointer(skey->sk_argument))
reqflags |= SK_BT_REQBKWD; /* safe, first row member */
if ((subkey->sk_flags & reqflags) &&
ScanDirectionIsForward(dir))
*continuescan = false;
}
/*
* In any case, this indextuple doesn't match the qual.
*/
return false;
}
/* Perform the test --- three-way comparison not bool operator */
cmpresult = DatumGetInt32(FunctionCall2Coll(&subkey->sk_func,
subkey->sk_collation,