postgres

mirror of https://github.com/postgres/postgres.git synced 2025-08-31 17:02:12 +03:00

Author	SHA1	Message	Date
Peter Geoghegan	7505da2f45	Reverse order of newitem nbtree candidate splits. Commit `fab25024`, which taught nbtree to choose candidate split points more carefully, had _bt_findsplitloc() record all possible split points in an initial pass over a page that is about to be split. The order that candidate split points were processed and stored in was assumed to match the offset number order of split points on an imaginary version of the page that contains the same items as the original, but also fits newitem (the item that provoked the split precisely because it didn't fit). However, the order of split points in the final array was not quite what was expected: the split point that makes newitem the firstright item came after the split point that makes newitem the lastleft item -- not before. As a result, _bt_findsplitloc() could get confused about the leftmost and rightmost tuples among all possible split points recorded for the page. This seems to have no appreciable impact on the quality of the final split point chosen by _bt_findsplitloc(), but it's still wrong. To fix, switch the order in which newitem candidate splits are recorded in. This also makes it possible to describe candidate split points in terms of which pair of adjoining tuples enclose the split point within _bt_findsplitloc(), making it clearer why it's generally safe for _bt_split() to expect lastleft and firstright tuples.	2019-05-15 12:22:07 -07:00
Peter Geoghegan	ae7291acbc	Standardize ItemIdData terminology. The term "item pointer" should not be used to refer to ItemIdData variables, since that is needlessly ambiguous. Only ItemPointerData/ItemPointer variables should be called item pointers. To fix, establish the convention that ItemIdData variables should always be referred to either as "item identifiers" or "line pointers". The term "item identifier" already predominates in docs and translatable messages, and so should be the preferred alternative there. Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com	2019-05-13 15:53:39 -07:00
Peter Geoghegan	9b42e71376	Don't leave behind junk nbtree pages during split. Commit `8fa30f906b` reduced the elevel of a number of "can't happen" _bt_split() errors from PANIC to ERROR. At the same time, the new right page buffer for the split could continue to be acquired well before the critical section. This was possible because it was relatively straightforward to make sure that _bt_split() could not throw an error, with a few specific exceptions. The exceptional cases were safe because they involved specific, well understood errors, making it possible to consistently zero the right page before actually raising an error using elog(). There was no danger of leaving around a junk page, provided _bt_split() stuck to this coding rule. Commit `8224de4f`, which introduced INCLUDE indexes, added code to make _bt_split() truncate away non-key attributes. This happened at a point that broke the rule around zeroing the right page in _bt_split(). If truncation failed (perhaps due to palloc() failure), that would result in an errant right page buffer with junk contents. This could confuse VACUUM when it attempted to delete the page, and should be avoided on general principle. To fix, reorganize _bt_split() so that truncation occurs before the new right page buffer is even acquired. A junk page/buffer will not be left behind if _bt_nonkey_truncate()/_bt_truncate() raise an error. Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com Backpatch: 11-, where INCLUDE indexes were introduced.	2019-05-13 10:27:59 -07:00
Peter Geoghegan	d95e36dc38	Remove obsolete nbtree split REDO routine comment. Commit `dd299df818`, which added suffix truncation to nbtree, simplified the WAL record format used by page splits. It became necessary to explicitly WAL-log the new high key for the left half of a split in all cases, which relieved the REDO routine from having to reconstruct a new high key for the left page by copying the first item from the right page. Remove a comment that referred to the previous practice.	2019-05-08 12:47:20 -07:00
Peter Geoghegan	d65b5ccad6	Correct obsolete nbtsort.c minimum key comment. It is no longer possible under any circumstances for nbtree code to reconstruct a strict lower bound key (parent page's pivot tuple key) for a right sibling page by retrieving the first item in the right sibling page.	2019-05-07 21:42:12 -07:00
Peter Geoghegan	7b37f4b02e	Correct more obsolete nbtree page split comments. Commit `3f342839` corrected obsolete comments about buffer locks at the main _bt_insert_parent() call site, but missed similar obsolete comments above _bt_insert_parent() itself. Both sets of comments were rendered obsolete by commit `40dae7ec53`, which made the nbtree page split algorithm more robust. Fix the comments that were missed the first time around now. In passing, refine a related _bt_insert_parent() comment about re-finding the parent page to insert new downlink.	2019-05-03 13:34:45 -07:00
Peter Geoghegan	6dd86c269d	Fix nbtsort.c's page space accounting. Commit `dd299df818`, which made heap TID a tiebreaker nbtree index column, introduced new rules on page space management to make suffix truncation safe. In general, suffix truncation needs to have a small amount of extra space available on the new left page when splitting a leaf page. This is needed in case it turns out that truncation cannot even "truncate away the heap TID column", resulting in a larger-than-firstright leaf high key with an explicit heap TID representation. Despite all this, CREATE INDEX/nbtsort.c did not account for the possible need for extra heap TID space on leaf pages when deciding whether or not a new item could fit on current page. This could lead to "failed to add item to the index page" errors when CREATE INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a larger-than-firstright leaf high key (it only had space for firstright tuple, which was just short of what was needed following "truncation"). Several conditions needed to be met all at once for CREATE INDEX to fail. The problem was in the hard limit on what will fit on a page, which tends to be masked by the soft fillfactor-wise limit. The easiest way to recreate the problem seems to be a CREATE INDEX on a low cardinality text column, with tuples that are of non-uniform width, using a fillfactor of 100. To fix, bring nbtsort.c in line with nbtsplitloc.c, which already pessimistically assumes that all leaf page splits will have high keys that have a heap TID appended. Reported-By: Andreas Joseph Krogh Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena	2019-05-02 12:33:35 -07:00
Alvaro Herrera	9a83afecb7	Widen tuple counter variables from long to int64 Mistake in ab0dfc961b6a; progress reporting would have wrapped around for indexes created with more than 2^31 tuples. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com	2019-04-30 10:27:38 -04:00
Peter Geoghegan	9ee7414ed0	Remove obsolete _bt_insert_parent() comment. Remove a comment that refers to a coding practice that was fully removed by commit `a8b8f4db`, which introduced MarkBufferDirty(). It looks like the comment was even obsolete before then, since it concerns write-ordering dependencies with synchronous buffer writes.	2019-04-29 14:14:38 -07:00
Peter Geoghegan	9b10926263	Prevent O(N^2) unique index insertion edge case. Commit `dd299df8` made nbtree treat heap TID as a tiebreaker column, establishing the principle that there is only one correct location (page and page offset number) for every index tuple, no matter what. Insertions of tuples into non-unique indexes proceed as if heap TID (scan key's scantid) is just another user-attribute value, but insertions into unique indexes are more delicate. The TID value in scantid must initially be omitted to ensure that the unique index insertion visits every leaf page that duplicates could be on. The scantid is set once again after unique checking finishes successfully, which can force _bt_findinsertloc() to step right one or more times, to locate the leaf page that the new tuple must be inserted on. Stepping right within _bt_findinsertloc() was assumed to occur no more frequently than stepping right within _bt_check_unique(), but there was one important case where that assumption was incorrect: inserting a "duplicate" with NULL values. Since _bt_check_unique() didn't do any real work in this case, it wasn't appropriate for _bt_findinsertloc() to behave as if it was finishing off a conventional unique insertion, where any existing physical duplicate must be dead or recently dead. _bt_findinsertloc() might have to grovel through a substantial portion of all of the leaf pages in the index to insert a single tuple, even when there were no dead tuples. To fix, treat insertions of tuples with NULLs into a unique index as if they were insertions into a non-unique index: never unset scantid before calling _bt_search() to descend the tree, and bypass _bt_check_unique() entirely. _bt_check_unique() is no longer responsible for incoming tuples with NULL values. Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com	2019-04-23 10:33:57 -07:00
Alexander Korotkov	1e87198182	Fix division by zero in _bt_vacuum_needs_cleanup() Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when metad->btm_last_cleanup_num_heap_tuples == 0. This commit adjusts the expression so that no division by zero might happen. Reported-by: Piotr Stefaniak Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com Reviewed-by: Masahiko Sawada Backpatch-through: 11	2019-04-15 20:20:43 +03:00
Peter Geoghegan	74eb2176bf	Invalidate binary search bounds consistently. _bt_check_unique() failed to invalidate binary search bounds in the event of a live conflict following commit `e5adcb78`. This resulted in problems after waiting for the conflicting xact to commit or abort. The subsequent call to _bt_check_unique() would restore the initial binary search bounds, rather than starting a new search. Fix by explicitly invalidating bounds when it becomes clear that there is a live conflict that insertion will have to wait to resolve. Ashutosh Sharma, with a few additional tweaks by me. Author: Ashutosh Sharma Reported-By: Ashutosh Sharma Diagnosed-By: Ashutosh Sharma Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com	2019-04-04 09:38:08 -07:00
Alvaro Herrera	ab0dfc961b	Report progress of CREATE INDEX operations This uses the progress reporting infrastructure added by `c16dc1aca5`, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql	2019-04-02 15:18:08 -03:00
Andres Freund	4bb50236eb	tableam: Formatting and other minor cleanups. The superflous heapam_xlog.h includes were reported by Peter Geoghegan.	2019-03-31 18:16:53 -07:00
Peter Geoghegan	76a39f2295	Fix nbtree high key "continuescan" row compare bug. Commit `29b64d1d` mishandled skipping over truncated high key attributes during row comparisons. The row comparison key matching loop would loop forever when a truncated attribute was encountered for a row compare subkey. Fix by following the example of other code in the loop: advance the current subkey, or break out of the loop when the last subkey is reached. Add test coverage for the relevant _bt_check_rowcompare() code path. The new test case is somewhat tied to nbtree implementation details, which isn't ideal, but seems unavoidable.	2019-03-31 17:24:04 -07:00
Peter Geoghegan	9c7fb7e6d8	Tweak some nbtree-related code comments.	2019-03-29 12:29:05 -07:00
Andres Freund	2a96909a4a	tableam: Support for an index build's initial table scan(s). To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-27 19:59:06 -07:00
Andres Freund	558a9165e0	Compute XID horizon for page level index vacuum on primary. Previously the xid horizon was only computed during WAL replay. That had two major problems: 1) It relied on knowing what the table pointed to looks like. That was easy enough before the introducing of tableam (we knew it had to be heap, although some trickery around logging the heap relfilenodes was required). But to properly handle table AMs we need per-database catalog access to look up the AM handler, which recovery doesn't allow. 2) Not knowing the xid horizon also makes it hard to support logical decoding on standbys. When on a catalog table, we need to be able to conflict with slots that have an xid horizon that's too old. But computing the horizon by visiting the heap only works once consistency is reached, but we always need to be able to detect conflicts. There's also a secondary problem, in that the current method performs redundant work on every standby. But that's counterbalanced by potentially computing the value when not necessary (either because there's no standby, or because there's no connected backends). Solve 1) and 2) by moving computation of the xid horizon to the primary and by involving tableam in the computation of the horizon. To address the potentially increased overhead, increase the efficiency of the xid horizon computation for heap by sorting the tids, and eliminating redundant buffer accesses. When prefetching is available, additionally perform prefetching of buffers. As this is more of a maintenance task, rather than something routinely done in every read only query, we add an arbitrary 10 to the effective concurrency - thereby using IO concurrency, when not globally enabled. That's possibly not the perfect formula, but seems good enough for now. Bumps WAL format, as latestRemovedXid is now part of the records, and the heap's relfilenode isn't anymore. Author: Andres Freund, Amit Khandekar, Robert Haas Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-26 16:52:54 -07:00
Andres Freund	71bdc99d0d	tableam: Add helper for indexes to check if a corresponding table tuples exist. This is, likely exclusively, useful to verify that conflicts detected in a unique index are with live tuples, rather than dead ones. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-25 16:52:55 -07:00
Peter Geoghegan	f21668f328	Add "split after new tuple" nbtree optimization. Add additional heuristics to the algorithm for locating an optimal split location. New logic identifies localized monotonically increasing values in indexes with multiple columns. When this insertion pattern is detected, page splits split just after the new item that provoked a page split (or apply leaf fillfactor in the style of a rightmost page split). This optimization is a variation of the long established leaf fillfactor optimization used during rightmost page splits. 50/50 page splits are only appropriate with a pattern of truly random insertions, where the average space utilization ends up at 65% - 70%. Without this patch, affected cases have leaf pages that are no more than about 50% full on average. Future insertions can never make use of the free space left behind. With this patch, affected cases have leaf pages that are about 90% full on average (assuming a fillfactor of 90). Localized monotonically increasing insertion patterns are presumed to be fairly common in real-world applications. There is a fair amount of anecdotal evidence for this. Both pg_depend system catalog indexes (pg_depend_depender_index and pg_depend_reference_index) are at least 20% smaller after the regression tests are run when the optimization is available. Furthermore, many of the indexes created by a fair use implementation of TPC-C for Postgres are consistently about 40% smaller when the optimization is available. Note that even pg_upgrade'd v3 indexes make use of this optimization. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com	2019-03-25 09:44:25 -07:00
Peter Geoghegan	59ab3be9e4	Remove dead code from nbtsplitloc.c. It doesn't make sense to consider the possibility that there will only be one candidate split point when choosing among split points to find the split with the lowest penalty. This is a vestige of an earlier version of the patch that became commit `fab25024`. Issue spotted while rereviewing coverage of the nbtree patch series using gcov.	2019-03-24 12:28:58 -07:00
Peter Geoghegan	29b64d1de7	Add nbtree high key "continuescan" optimization. Teach nbtree forward index scans to check the high key before moving to the right sibling page in the hope of finding that it isn't actually necessary to do so. The new check may indicate that the scan definitely cannot find matching tuples to the right, ending the scan immediately. We already opportunistically force a similar "continuescan orientated" key check of the final non-pivot tuple when it's clear that it cannot be returned to the scan due to being dead-to-all. The new high key check is complementary. The new approach for forward scans is more effective than checking the final non-pivot tuple, especially with composite indexes and non-unique indexes. The improvements to the logic for picking a split point added by commit `fab25024` make it likely that relatively dissimilar high keys will appear on a page. A distinguishing key value that can only appear on non-pivot tuples on the right sibling page will often be present in leaf page high keys. Since forcing the final item to be key checked no longer makes any difference in the case of forward scans, the existing extra key check is now only used for backwards scans. Backward scans continue to opportunistically check the final non-pivot tuple, which is actually the first non-pivot tuple on the page (not the last). Note that even pg_upgrade'd v3 indexes make use of this optimization. Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com	2019-03-23 11:01:53 -07:00
Peter Geoghegan	3d0dcc5c7f	Fix spurious compiler warning in nbtxlog.c. Cleanup from commit `dd299df8`. Per complaint from Tom Lane.	2019-03-20 14:04:35 -07:00
Peter Geoghegan	fab2502433	Consider secondary factors during nbtree splits. Teach nbtree to give some consideration to how "distinguishing" candidate leaf page split points are. This should not noticeably affect the balance of free space within each half of the split, while still making suffix truncation truncate away significantly more attributes on average. The logic for choosing a leaf split point now uses a fallback mode in the case where the page is full of duplicates and it isn't possible to find even a minimally distinguishing split point. When the page is full of duplicates, the split should pack the left half very tightly, while leaving the right half mostly empty. Our assumption is that logical duplicates will almost always be inserted in ascending heap TID order with v4 indexes. This strategy leaves most of the free space on the half of the split that will likely be where future logical duplicates of the same value need to be placed. The number of cycles added is not very noticeable. This is important because deciding on a split point takes place while at least one exclusive buffer lock is held. We avoid using authoritative insertion scankey comparisons to save cycles, unlike suffix truncation proper. We use a faster binary comparison instead. Note that even pg_upgrade'd v3 indexes make use of these optimizations. Benchmarking has shown that even v3 indexes benefit, despite the fact that suffix truncation will only truncate non-key attributes in INCLUDE indexes. Grouping relatively similar tuples together is beneficial in and of itself, since it reduces the number of leaf pages that must be accessed by subsequent index scans. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com	2019-03-20 10:12:19 -07:00
Peter Geoghegan	dd299df818	Make heap TID a tiebreaker nbtree index column. Make nbtree treat all index tuples as having a heap TID attribute. Index searches can distinguish duplicates by heap TID, since heap TID is always guaranteed to be unique. This general approach has numerous benefits for performance, and is prerequisite to teaching VACUUM to perform "retail index tuple deletion". Naively adding a new attribute to every pivot tuple has unacceptable overhead (it bloats internal pages), so suffix truncation of pivot tuples is added. This will usually truncate away the "extra" heap TID attribute from pivot tuples during a leaf page split, and may also truncate away additional user attributes. This can increase fan-out, especially in a multi-column index. Truncation can only occur at the attribute granularity, which isn't particularly effective, but works well enough for now. A future patch may add support for truncating "within" text attributes by generating truncated key values using new opclass infrastructure. Only new indexes (BTREE_VERSION 4 indexes) will have insertions that treat heap TID as a tiebreaker attribute, or will have pivot tuples undergo suffix truncation during a leaf page split (on-disk compatibility with versions 2 and 3 is preserved). Upgrades to version 4 cannot be performed on-the-fly, unlike upgrades from version 2 to version 3. contrib/amcheck continues to work with version 2 and 3 indexes, while also enforcing stricter invariants when verifying version 4 indexes. These stricter invariants are the same invariants described by "3.1.12 Sequencing" from the Lehman and Yao paper. A later patch will enhance the logic used by nbtree to pick a split point. This patch is likely to negatively impact performance without smarter choices around the precise point to split leaf pages at. Making these two mostly-distinct sets of enhancements into distinct commits seems like it might clarify their design, even though neither commit is particularly useful on its own. The maximum allowed size of new tuples is reduced by an amount equal to the space required to store an extra MAXALIGN()'d TID in a new high key during leaf page splits. The user-facing definition of the "1/3 of a page" restriction is already imprecise, and so does not need to be revised. However, there should be a compatibility note in the v12 release notes. Author: Peter Geoghegan Reviewed-By: Heikki Linnakangas, Alexander Korotkov Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com	2019-03-20 10:04:01 -07:00
Peter Geoghegan	e5adcb789d	Refactor nbtree insertion scankeys. Use dedicated struct to represent nbtree insertion scan keys. Having a dedicated struct makes the difference between search type scankeys and insertion scankeys a lot clearer, and simplifies the signature of several related functions. This is based on a suggestion by Andrey Lepikhov. Streamline how unique index insertions cache binary search progress. Cache the state of in-progress binary searches within _bt_check_unique() for later instead of having callers avoid repeating the binary search in an ad-hoc manner. This makes it easy to add a new optimization: _bt_check_unique() now falls out of its loop immediately in the common case where it's already clear that there couldn't possibly be a duplicate. The new _bt_check_unique() scheme makes it a lot easier to manage cached binary search effort afterwards, from within _bt_findinsertloc(). This is needed for the upcoming patch to make nbtree tuples unique by treating heap TID as a final tiebreaker column. Unique key binary searches need to restore lower and upper bounds. They cannot simply continue to use the >= lower bound as the offset to insert at, because the heap TID tiebreaker column must be used in comparisons for the restored binary search (unlike the original _bt_check_unique() binary search, where scankey's heap TID column must be omitted). Author: Peter Geoghegan, Heikki Linnakangas Reviewed-By: Heikki Linnakangas, Andrey Lepikhov Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com	2019-03-20 09:30:57 -07:00
Peter Geoghegan	1009920aaa	Tweak nbtsearch.c function prototype order. nbtsearch.c's static function prototypes were slightly out of order. Make the order consistent with static function definition order.	2019-03-19 09:59:05 -07:00
Thomas Munro	bb16aba50c	Enable parallel query with SERIALIZABLE isolation. Previously, the SERIALIZABLE isolation level prevented parallel query from being used. Allow the two features to be used together by sharing the leader's SERIALIZABLEXACT with parallel workers. An extra per-SERIALIZABLEXACT LWLock is introduced to make it safe to share, and new logic is introduced to coordinate the early release of the SERIALIZABLEXACT required for the SXACT_FLAG_RO_SAFE optimization, as follows: The first backend to observe the SXACT_FLAG_RO_SAFE flag (set by some other transaction) will 'partially release' the SERIALIZABLEXACT, meaning that the conflicts and locks it holds are released, but the SERIALIZABLEXACT itself will remain active because other backends might still have a pointer to it. Whenever any backend notices the SXACT_FLAG_RO_SAFE flag, it clears its own MySerializableXact variable and frees local resources so that it can skip SSI checks for the rest of the transaction. In the special case of the leader process, it transfers the SERIALIZABLEXACT to a new variable SavedSerializableXact, so that it can be completely released at the end of the transaction after all workers have exited. Remove the serializable_okay flag added to CreateParallelContext() by commit `9da0cc35`, because it's now redundant. Author: Thomas Munro Reviewed-by: Haribabu Kommi, Robert Haas, Masahiko Sawada, Kevin Grittner Discussion: https://postgr.es/m/CAEepm=0gXGYhtrVDWOTHS8SQQy_=S9xo+8oCxGLWZAOoeJ=yzQ@mail.gmail.com	2019-03-15 17:47:04 +13:00
Peter Geoghegan	3f34283973	Correct obsolete nbtree page split comment. Commit `40dae7ec53`, which made the nbtree page split algorithm more robust, made _bt_insert_parent() only unlock the right child of the parent page before inserting a new downlink into the parent. Update a comment from the Berkeley days claiming that both left and right child pages are unlocked before the new downlink actually gets inserted. The claim that it is okay to release both locks early based on Lehman and Yao's say-so never made much sense. Lehman and Yao must sometimes "couple" buffer locks across a pair of internal pages when relocating a downlink, unlike the corresponding code within _bt_getstack().	2019-03-12 16:40:05 -07:00
Andres Freund	8cacea7a72	Ensure sufficient alignment for ParallelTableScanDescData in BTShared. Previously ParallelTableScanDescData was just a member in BTShared, but after `c2fe139c2` that doesn't guarantee sufficient alignment as specific AMs might (are likely to) need atomic variables in the struct. One might think that MAXALIGNing would be sufficient, but as a comment in shm_toc_allocate() explains, that's not enough. For now, copy the hack described there. For parallel sequential scans no such change is needed, as its allocations go through shm_toc_allocate(). An alternative approach would have been to allocate the parallel scan descriptor in a separate TOC entry, but there seems little benefit in doing so. Per buildfarm member dromedary. Author: Andres Freund Discussion: https://postgr.es/m/20190311203126.ty5gbfz42gjbm6i6@alap3.anarazel.de	2019-03-11 14:26:43 -07:00
Andres Freund	c2fe139c20	tableam: Add and use scan APIs. Too allow table accesses to be not directly dependent on heap, several new abstractions are needed. Specifically: 1) Heap scans need to be generalized into table scans. Do this by introducing TableScanDesc, which will be the "base class" for individual AMs. This contains the AM independent fields from HeapScanDesc. The previous heap_{beginscan,rescan,endscan} et al. have been replaced with a table_ version. There's no direct replacement for heap_getnext(), as that returned a HeapTuple, which is undesirable for a other AMs. Instead there's table_scan_getnextslot(). But note that heap_getnext() lives on, it's still used widely to access catalog tables. This is achieved by new scan_begin, scan_end, scan_rescan, scan_getnextslot callbacks. 2) The portion of parallel scans that's shared between backends need to be able to do so without the user doing per-AM work. To achieve that new parallelscan_{estimate, initialize, reinitialize} callbacks are introduced, which operate on a new ParallelTableScanDesc, which again can be subclassed by AMs. As it is likely that several AMs are going to be block oriented, block oriented callbacks that can be shared between such AMs are provided and used by heap. table_block_parallelscan_{estimate, intiialize, reinitialize} as callbacks, and table_block_parallelscan_{nextpage, init} for use in AMs. These operate on a ParallelBlockTableScanDesc. 3) Index scans need to be able to access tables to return a tuple, and there needs to be state across individual accesses to the heap to store state like buffers. That's now handled by introducing a sort-of-scan IndexFetchTable, which again is intended to be subclassed by individual AMs (for heap IndexFetchHeap). The relevant callbacks for an AM are index_fetch_{end, begin, reset} to create the necessary state, and index_fetch_tuple to retrieve an indexed tuple. Note that index_fetch_tuple implementations need to be smarter than just blindly fetching the tuples for AMs that have optimizations similar to heap's HOT - the currently alive tuple in the update chain needs to be fetched if appropriate. Similar to table_scan_getnextslot(), it's undesirable to continue to return HeapTuples. Thus index_fetch_heap (might want to rename that later) now accepts a slot as an argument. Core code doesn't have a lot of call sites performing index scans without going through the systable_* API (in contrast to loads of heap_getnext calls and working directly with HeapTuples). Index scans now store the result of a search in IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the target is not generally a HeapTuple anymore that seems cleaner. To be able to sensible adapt code to use the above, two further callbacks have been introduced: a) slot_callbacks returns a TupleTableSlotOps* suitable for creating slots capable of holding a tuple of the AMs type. table_slot_callbacks() and table_slot_create() are based upon that, but have additional logic to deal with views, foreign tables, etc. While this change could have been done separately, nearly all the call sites that needed to be adapted for the rest of this commit also would have been needed to be adapted for table_slot_callbacks(), making separation not worthwhile. b) tuple_satisfies_snapshot checks whether the tuple in a slot is currently visible according to a snapshot. That's required as a few places now don't have a buffer + HeapTuple around, but a slot (which in heap's case internally has that information). Additionally a few infrastructure changes were needed: I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now internally uses a slot to keep track of tuples. While systable_getnext() still returns HeapTuples, and will so for the foreseeable future, the index API (see 1) above) now only deals with slots. The remainder, and largest part, of this commit is then adjusting all scans in postgres to use the new APIs. Author: Andres Freund, Haribabu Kommi, Alvaro Herrera Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql	2019-03-11 12:46:41 -07:00
Peter Geoghegan	35bc0ec7c8	Note case where nbtree VACUUM finishes splits. The nbtree README claims that VACUUM can never finish interrupted page splits by design. That isn't entirely accurate, though. Note an exception to the general rule. Discussion: https://postgr.es/m/CAH2-Wz=_Xvv8byzK_LvY4ci76OgsHCQzoKF7We8yG9waO7j6rA@mail.gmail.com	2019-03-04 17:57:36 -08:00
Peter Geoghegan	72c7c4e386	Correct obsolete nbtree page split WAL comment. Commit `2c03216d83`, which revamped the WAL record format, failed to update a comment referencing the old API. Update the comment.	2019-03-04 12:32:40 -08:00
Peter Geoghegan	2ab23445bc	Remove unneeded argument from _bt_getstackbuf(). _bt_getstackbuf() is called at exactly two points following commit `efada2b8e9` (one call site is concerned with page splits, while the other is concerned with page deletion). The parent buffer returned by _bt_getstackbuf() is write-locked in both cases. Remove the 'access' argument and make _bt_getstackbuf() assume that callers require a write-lock.	2019-02-25 17:47:43 -08:00
Peter Geoghegan	067786cea0	Correct obsolete nbtree page deletion comment. Commit `efada2b8e9`, which made the nbtree page deletion algorithm more robust, removed _bt_getstackbuf() calls from _bt_pagedel(). It failed to update a comment that referenced the earlier approach. Update the comment to explain that the _bt_getstackbuf() page deletion call site mirrors the only other remaining _bt_getstackbuf() call site, which is reached during page splits.	2019-02-25 16:54:18 -08:00
Andres Freund	b7eda3e0e3	Move generic snapshot related code from tqual.h to snapmgr.h. The code in tqual.c is largely heap specific. Due to the upcoming pluggable storage work, it therefore makes sense to move it into access/heap/ (as the file's header notes, the tqual name isn't very good). But the various statically allocated snapshot and snapshot initialization functions are now (see previous commit) generic and do not depend on functions declared in tqual.h anymore. Therefore move. Also move XidInMVCCSnapshot as that's useful for future AMs, and already used outside of tqual.c. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-01-21 17:06:41 -08:00
Andres Freund	63746189b2	Change snapshot type to be determined by enum rather than callback. This is in preparation for allowing the same snapshot be used for different table AMs. With the current callback based approach we would need one callback for each supported AM, which clearly would not be extensible. Thus add a new Snapshot->snapshot_type field, and move the dispatch into HeapTupleSatisfiesVisibility() (which is now a function). Later work will then dispatch calls to HeapTupleSatisfiesVisibility() and other AMs visibility functions depending on the type of the table. The central SnapshotType enum also seems like a good location to centralize documentation about the intended behaviour of various types of snapshots. As tqual.h isn't included by bufmgr.h any more (as HeapTupleSatisfies* isn't referenced by TestForOldSnapshot() anymore) a few files now need to include it directly. Author: Andres Freund, loosely based on earlier work by Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql	2019-01-21 17:03:15 -08:00
Andres Freund	e7cc78ad43	Remove superfluous tqual.h includes. Most of these had been obsoleted by `568d4138c` / the SnapshotNow removal. This is is preparation for moving most of tqual.[ch] into either snapmgr.h or heapam.h, which in turn is in preparation for pluggable table AMs. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-01-21 12:15:02 -08:00
Andres Freund	e0c4ec0728	Replace uses of heap_open et al with the corresponding table_* function. Author: Andres Freund Discussion: https://postgr.es/m/20190111000539.xbv7s6w7ilcvm7dp@alap3.anarazel.de	2019-01-21 10:51:37 -08:00
Andres Freund	90525d7b4e	Don't duplicate parallel seqscan shmem sizing logic in nbtree. This is architecturally mildly problematic, which becomes more pronounced with the upcoming introduction of pluggable storage. To fix, teach heap_parallelscan_estimate() to deal with SnapshotAny snapshots, and then use it from _bt_parallel_estimate_shared(). Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-01-15 12:19:21 -08:00
Andres Freund	4c850ecec6	Don't include heapam.h from others headers. heapam.h previously was included in a number of widely used headers (e.g. execnodes.h, indirectly in executor.h, ...). That's problematic on its own, as heapam.h contains a lot of low-level details that don't need to be exposed that widely, but becomes more problematic with the upcoming introduction of pluggable table storage - it seems inappropriate for heapam.h to be included that widely afterwards. heapam.h was largely only included in other headers to get the HeapScanDesc typedef (which was defined in heapam.h, even though HeapScanDescData is defined in relscan.h). The better solution here seems to be to just use the underlying struct (forward declared where necessary). Similar for BulkInsertState. Another problem was that LockTupleMode was used in executor.h - parts of the file tried to cope without heapam.h, but due to the fact that it indirectly included it, several subsequent violations of that goal were not not noticed. We could just reuse the approach of declaring parameters as int, but it seems nicer to move LockTupleMode to lockoptions.h - that's not a perfect location, but also doesn't seem bad. As a number of files relied on implicitly included heapam.h, a significant number of files grew an explicit include. It's quite probably that a few external projects will need to do the same. Author: Andres Freund Reviewed-By: Alvaro Herrera Discussion: https://postgr.es/m/20190114000701.y4ttcb74jpskkcfb@alap3.anarazel.de	2019-01-14 16:24:41 -08:00
Bruce Momjian	97c39498e5	Update copyright for 2019 Backpatch-through: certain files through 9.4	2019-01-02 12:44:25 -05:00
Tom Lane	586b98fdf1	Make type "name" collation-aware. The "name" comparison operators now all support collations, making them functionally equivalent to "text" comparisons, except for the different physical representation of the datatype. They do, in fact, mostly share the varstr_cmp and varstr_sortsupport infrastructure, which has been slightly enlarged to handle the case. To avoid changes in the default behavior of the datatype, set name's typcollation to C_COLLATION_OID not DEFAULT_COLLATION_OID, so that by default comparisons to a name value will continue to use strcmp semantics. (This would have been the case for system catalog columns anyway, because of commit `6b0faf723`, but doing this makes it true for user-created name columns as well. In particular, this avoids locale-dependent changes in our regression test results.) In consequence, tweak a couple of places that made assumptions about collatable base types always having typcollation DEFAULT_COLLATION_OID. I have not, however, attempted to relax the restriction that user- defined collatable types must have that. Hence, "name" doesn't behave quite like a user-defined type; it acts more like a domain with COLLATE "C". (Conceivably, if we ever get rid of the need for catalog name columns to be fixed-length, "name" could actually become such a domain over text. But that'd be a pretty massive undertaking, and I'm not volunteering.) Discussion: https://postgr.es/m/15938.1544377821@sss.pgh.pa.us	2018-12-19 17:46:25 -05:00
Peter Geoghegan	61a4480a68	Remove obsolete nbtree duplicate entries comment. Remove a comment from the Berkeley days claiming that nbtree must disambiguate duplicate keys within _bt_moveright(). There is no special care taken around duplicates within _bt_moveright(), at least since commit `9e85183bfc` removed inscrutable _bt_moveright() code to handle pages full of duplicates.	2018-12-18 21:40:38 -08:00
Peter Geoghegan	60f3cc9553	Correct obsolete nbtree recovery comments. Commit `40dae7ec53`, which made the handling of interrupted nbtree page splits more robust, removed an nbtree-specific end-of-recovery cleanup step. This meant that it was no longer possible to complete an interrupted page split during recovery. However, a reference to recovery as a reason for using a NULL stack while inserting into a parent page was missed. Remove the reference. Remove a similar obsolete reference to recovery that was introduced much more recently, as part of the btree fastpath optimization enhancement that made it into Postgres 11 (commit `2b272734`, and follow-up commits). Backpatch: 11-, where the fastpath optimization was introduced.	2018-12-18 16:59:50 -08:00
Magnus Hagander	fbec7459aa	Fix spelling errors and typos in comments Author: Daniel Gustafsson <daniel@yesql.se>	2018-11-02 13:56:52 +01:00
Tom Lane	c87cb5f7a6	Allow btree comparison functions to return INT_MIN. Historically we forbade datatype-specific comparison functions from returning INT_MIN, so that it would be safe to invert the sort order just by negating the comparison result. However, this was never really safe for comparison functions that directly return the result of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction on those library functions. Buildfarm results show that at least on recent Linux on s390x, memcmp() actually does return INT_MIN sometimes, causing sort failures. The agreed-on answer is to remove this restriction and fix relevant call sites to not make such an assumption; code such as "res = -res" should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed in a few places that just directly negated the result of memcmp or strcmp. To help find places having this problem, I've also added a compile option to nbtcompare.c that causes some of the commonly used comparators to return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be a good idea to have at least one buildfarm member running with "-DSTRESS_SORT_INT_MIN". That's far from a complete test of course, but it should help to prevent fresh introductions of such bugs. This is a longstanding portability hazard, so back-patch to all supported branches. Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de	2018-10-05 16:01:29 -04:00
Alexander Korotkov	d2086b08b0	Reduce path length for locking leaf B-tree pages during insertion In our B-tree implementation appropriate leaf page for new tuple insertion is acquired using _bt_search() function. This function always returns leaf page locked in shared mode. In order to obtain exclusive lock, caller have to relock the page. This commit makes _bt_search() function lock leaf page immediately in exclusive mode when needed. That removes unnecessary relock and, in turn reduces lock contention for B-tree leaf pages. Our experiments on multi-core systems showed acceleration up to 4.5 times in corner case. Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan	2018-07-28 00:31:40 +03:00
Amit Kapila	8ce29bb4f0	Fix the buffer release order for parallel index scans. During parallel index scans, if the current page to be read is deleted, we skip it and try to get the next page for a scan without releasing the buffer lock on the current page. To get the next page, sometimes it needs to wait for another process to complete its scan and advance it to the next page. Now, it is quite possible that the master backend has errored out before advancing the scan and issued a termination signal for all workers. The workers failed to notice the termination request during wait because the interrupts are held due to buffer lock on the previous page. This lead to all workers being stuck. The fix is to release the buffer lock on current page before trying to get the next page. We are already doing same in backward scans, but missed it for forward scans. Reported-by: Victor Yegorov Bug: 15290 Diagnosed-by: Thomas Munro and Amit Kapila Author: Amit Kapila Reviewed-by: Thomas Munro Tested-By: Thomas Munro and Victor Yegorov Backpatch-through: 10 where parallel index scans were introduced Discussion: https://postgr.es/m/153228422922.1395.1746424054206154747@wrigleys.postgresql.org	2018-07-27 10:53:00 +05:30
Tom Lane	0905fe8911	Avoid emitting a bogus WAL record when recycling an all-zero btree page. Commit `fafa374f2` caused _bt_getbuf() to possibly emit a WAL record for a page that it was about to recycle. However, it failed to distinguish all-zero pages from dead pages, which is important because only the latter have valid btpo.xact values, or indeed any special space at all. Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore led to an Assert failure, or to emission of a WAL record containing a bogus cutoff XID, which might lead to unnecessary query cancellations on hot standby servers. Per reports from Antonin Houska and 自己. Amit Kapila was first to propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi reviewed it at various times. This is an old bug, so back-patch to all supported branches. Discussion: https://postgr.es/m/2628.1474272158@localhost Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com	2018-07-09 19:26:19 -04:00

1 2 3 4 5 ...

671 Commits