postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-09 06:21:09 +03:00

Author	SHA1	Message	Date
Peter Geoghegan	454c046094	nbtree: Always set skipScan flag on rescan. The TimescaleDB extension expects to be able to change an nbtree scan's keys across rescans. The issue arises in the extension's implementation of loose index scan. This is arguably a misuse of the index AM API, though apparently it worked until recently. It stopped working when the skipScan flag was added to BTScanOpaqueData by commit `8a510275`, though. The flag wouldn't reliably track whether the scan (actually, the current rescan) has any skip arrays, leading to confusion in _bt_set_startikey. nbtree preprocessing will now defensively initialize the scan's skipScan flag in all cases, including the case where _bt_preprocess_array_keys returns early due to the (re)scan not using arrays. While nbtree isn't obligated to support this use case (at least not according to my reading of the index AM API), it still seems like a good idea to be consistent here, on general robustness grounds. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Natalya Aksman <natalya@timescale.com> Discussion: https://postgr.es/m/CAJumhcirfMojbk20+W0YimbNDkwdECvJprQGQ-XqK--ph09nQw@mail.gmail.com Backpatch-through: 18	2025-09-13 21:01:33 -04:00
Nathan Bossart	7e9c216b52	Re-pgindent nbtpreprocesskeys.c after commit `796962922e`. Backpatch-through: 18	2025-09-13 14:50:02 -05:00
Peter Geoghegan	796962922e	Always commute strategy when preprocessing DESC keys. A recently added nbtree preprocessing step failed to account for the fact that DESC columns already had their B-Tree strategy number commuted at this point in preprocessing. As a result, preprocessing could output a set of scan keys where one or more keys had the correct strategy number, but used the wrong comparison routine. To fix, make the faulty code path that looks up a more restrictive replacement operator/comparison routine commute its requested inequality strategy (while outputting the transformed strategy number as before). This makes the final transformed scan key comport with the approach preprocessing has always used to deal with DESC columns (which is described by comments above _bt_fix_scankey_strategy). Oversight in commit commit `b3f1a13f`, which made nbtree preprocessing perform transformations on skip array inequalities that can reduce the total number of index searches. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Natalya Aksman <natalya@timescale.com> Discussion: https://postgr.es/m/19049-b7df801e71de41b2@postgresql.org Backpatch-through: 18	2025-09-12 13:23:00 -04:00
Álvaro Herrera	d0e7e04ede	Avoid including tableam.h and xlogreader.h in nbtree.h Doing that seems rather random and unnecessary. This commit removes those and fixes fallout, which is pretty minimal. We do need to add a forward declaration of struct TM_IndexDeleteOp (whose full definition appears in tableam.h) so that _bt_delitems_delete_check()'s declaration can use it. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql	2025-08-14 17:48:46 +02:00
Peter Geoghegan	9ca30a0b04	Update obsolete row compare preprocessing comments. Restore nbtree preprocessing comments describing how we mark nbtree row compare members required to how they were prior to 2016 bugfix commit `a298a1e0`. Oversight in commit `bd3f59fd`, which made nbtree preprocessing revert to the original 2006 rules, but neglected to revert these comments. Backpatch-through: 18	2025-07-02 12:36:35 -04:00
Peter Geoghegan	bd3f59fdb7	Make row compares robust during nbtree array scans. Recent nbtree bugfix commit `5f4d98d4` added a special case to the code that sets up a page-level prefix of keys that are definitely satisfied by every tuple on the page: whenever _bt_set_startikey reached a row compare key, we'd refuse to apply the pstate.forcenonrequired behavior in scans where that usually happens (scans with a higher-order array key). That hack made the scan avoid essentially the same infinite cycling behavior that also affected nbtree scans with redundant keys (keys that preprocessing could not eliminate) prior to commit `f09816a0`. There are now serious doubts about this row compare workaround. Testing has shown that a scan with a row compare key and an array key could still read the same leaf page twice (without the scan's direction changing), which isn't supposed to be possible following the SAOP enhancements added by Postgres 17 commit `5bf748b8`. Also, we still allowed a required row compare key to be used with forcenonrequired mode when its header key happened to be beyond the pstate.ikey set by _bt_set_startikey, which was complicated and brittle. The underlying problem was that row compares had inconsistent rules around how scans start (which keys can be used for initial positioning purposes) and how scans end (which keys can set continuescan=false). Quals with redundant keys that could not be eliminated by preprocessing also had that same quality to them prior to today's bugfix `f09816a0`. It now seems prudent to bring row compare keys in line with the new charter for required keys, by making the start and end rules symmetric. This commit fixes two points of disagreement between _bt_first and _bt_check_rowcompare. Firstly, _bt_check_rowcompare was capable of ending the scan at the point where it needed to compare an ISNULL-marked row compare member that came immediately after a required row compare member. _bt_first now has symmetric handling for NULL row compares. Secondly, _bt_first had its own ideas about which keys were safe to use for initial positioning purposes. It could use fewer or more keys than _bt_check_rowcompare. _bt_first now uses the same requiredness markings as _bt_check_rowcompare for this. Now that _bt_first and _bt_check_rowcompare agree on how to start and end scans, we can get rid of the forcenonrequired special case, without any risk of infinite cycling. This approach also makes row compare keys behave more like regular scalar keys, particularly within _bt_first. Fixing these inconsistencies necessitates dealing with a related issue with the way that row compares were marked required by preprocessing: we didn't mark any lower-order row members required following 2016 bugfix commit `a298a1e0`. That approach was over broad. The bug in question was actually an oversight in how _bt_check_rowcompare dealt with tuple NULL values that failed to satisfy a scan key marked required in the opposite scan direction (it was a bug in 2011 commits `6980f817` and `882368e8`, not a bug in 2006 commit `3a0a16cb`). Go back to marking row compare members as required using the original 2006 rules, and fix the 2016 bug in a more principled way: by limiting use of the "set continuescan=false with a key required in the opposite scan direction upon encountering a NULL tuple value" optimization to the first/most significant row member key. While it isn't safe to use an implied IS NOT NULL qualifier to end the scan when it comes from a required lower-order row compare member key, it _is_ generally safe for such a required member key to end the scan -- provided the key is marked required in the _current_ scan direction. This fixes what was arguably an oversight in either commit `5f4d98d4` or commit `8a510275`. It is a direct follow-up to today's commit `f09816a0`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com Backpatch-through: 18	2025-07-02 09:48:15 -04:00
Peter Geoghegan	f09816a0a7	Make handling of redundant nbtree keys more robust. nbtree preprocessing's handling of redundant (and contradictory) keys created problems for scans with = arrays. It was just about possible for a scan with an = array key and one or more redundant keys (keys that preprocessing could not eliminate due an incomplete opfamily and a cross-type key) to get stuck. Testing has shown that infinite cycling where the scan never manages to make forward progress was possible. This could happen when the scan's arrays were reset in _bt_readpage's forcenonrequired=true path (added by bugfix commit `5f4d98d4`) when the arrays weren't at least advanced up to the same point that they were in at the start of the _bt_readpage call. Earlier redundant keys prevented the finaltup call to _bt_advance_array_keys from reaching lower-order keys that needed to be used to sufficiently advance the scan's arrays. To fix, make preprocessing leave the scan's keys in a state that is as close as possible to how it'll usually leave them (in the common case where there's no redundant keys that preprocessing failed to eliminate). Now nbtree preprocessing _reliably_ leaves behind at most one required >/>= key per index column, and at most one required </<= key per index column. Columns that have one or more = keys that are eligible to be marked required (based on the traditional rules) prioritize the = keys over redundant inequality keys; they'll _reliably_ be left with only one of the = keys as the index column's only required key. Keys that are not marked required (whether due to the new preprocessing step running or for some other reason) are relocated to the end of the so->keyData[] array as needed. That way they'll always be evaluated after the scan's required keys, and so cannot prevent code in places like _bt_advance_array_keys and _bt_first from reaching a required key. Also teach _bt_first to decide which initial positioning keys to use based on the same requiredness markings that have long been used by _bt_checkkeys/_bt_advance_array_keys. This is a necessary condition for reliably avoiding infinite cycling. _bt_advance_array_keys expects to be able to reason about what'll happen in the next _bt_first call should it start another primitive index scan, by evaluating inequality keys that were marked required in the opposite-to-scan scan direction only. Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys) will always agree on which exact key will be used on each index column to start and/or end the scan (except when row compare keys are involved, which have similar problems not addressed by this commit). An upcoming commit will finish off the work started by this commit by harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys apply row compare keys to start and end scans. This fixes what was arguably an oversight in either commit `5f4d98d4` or commit `8a510275`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com Backpatch-through: 18	2025-07-02 09:40:49 -04:00
Peter Geoghegan	9d924dbb37	Adjust overstrong nbtree skip array assertion. Make an nbtree array preprocessing assertion account for scans that add fewer skip arrays than initially expected due to preprocessing finding an unsatisfiable array qual. Oversight in commit `92fe23d9`. Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/CAHgHdKtQMhHy5qcB3KqCcGiW-Rp8P7KzUFRa9ZMKUiv6zen7LQ@mail.gmail.com	2025-04-30 23:15:51 -04:00
David Rowley	84fd3bc141	Fix a few duplicate words in comments These are all new to v18 Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com	2025-04-21 10:41:18 +12:00
Peter Geoghegan	b3f1a13f22	Avoid extra index searches through preprocessing. Transform low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) in such a way as to allow _bt_first to consistently apply later keys when it descends the tree. This can lower the number of index searches for multi-column scans that use a ">" key on one of the index's prefix columns (or use a "<" key, when scanning backwards) when it precedes some later lower-order key. For example, an index qual "WHERE a > 5 AND b = 2" will now be converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after low_compare and high_compare have been finalized. That way, the initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an initial position, rather than just using "WHERE a > 5" -- "b = 2" can be applied during every _bt_first call. There's a decent chance that this will allow such a scan to avoid the extra search that might otherwise be needed to determine the lowest "a" value still satisfying "WHERE a > 5". The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). But the savings can really add up in cases where an affected skip array comes after some other array. For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can save as many as 3 _bt_first calls by applying the new transformation to its "y" array (up to 1 extra search can be avoided per "x" element). Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com	2025-04-04 14:14:08 -04:00
Peter Geoghegan	8a510275dd	Further optimize nbtree search scan key comparisons. Postgres 17 commit `e0b1ee17` added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com	2025-04-04 12:27:52 -04:00
Peter Geoghegan	92fe23d93a	Add nbtree skip scan optimization. Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit `5bf748b8`. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com	2025-04-04 12:27:04 -04:00
Peter Geoghegan	67fc4c9fd7	Make parallel nbtree index scans use an LWLock. Teach parallel nbtree index scans to use an LWLock (not a spinlock) to protect the scan's shared descriptor state. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That patch will create the need to occasionally allocate memory while the scan descriptor is locked, while copying datums that were serialized by another backend. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-08 11:10:14 -05:00
Peter Geoghegan	901bd4a65a	Fix nbtree contradictory array element comment. Oversight in commit `5bf748b8`, which enhanced nbtree ScalarArrayOp execution.	2025-01-16 11:26:41 -05:00
Peter Geoghegan	597b1ffbf1	Move nbtree preprocessing into new .c file. Quite a bit of code within nbtutils.c is only called during nbtree preprocessing. Move that code into a new .c file, nbtpreprocesskeys.c. Also reorder some of the functions within the new file for clarity. This commit has no functional impact. It is strictly mechanical. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-WznwNn1BDOpWxHBUK1f3Rdw8pO9UCenWXnvT=n9GO8GnLA@mail.gmail.com Discussion: https://postgr.es/m/86930045-5df5-494a-b4f1-815bc3fbcce0%40iki.fi	2025-01-13 12:15:00 -05:00

15 Commits