Teach nbtree forward index scans to check the high key before moving to
the right sibling page in the hope of finding that it isn't actually
necessary to do so. The new check may indicate that the scan definitely
cannot find matching tuples to the right, ending the scan immediately.
We already opportunistically force a similar "continuescan orientated"
key check of the final non-pivot tuple when it's clear that it cannot be
returned to the scan due to being dead-to-all. The new high key check
is complementary.
The new approach for forward scans is more effective than checking the
final non-pivot tuple, especially with composite indexes and non-unique
indexes. The improvements to the logic for picking a split point added
by commit fab25024 make it likely that relatively dissimilar high keys
will appear on a page. A distinguishing key value that can only appear
on non-pivot tuples on the right sibling page will often be present in
leaf page high keys.
Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans. Backward scans continue to
opportunistically check the final non-pivot tuple, which is actually the
first non-pivot tuple on the page (not the last).
Note that even pg_upgrade'd v3 indexes make use of this optimization.
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are. This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.
The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point. When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty. Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes. This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.
The number of cycles added is not very noticeable. This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held. We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper. We
use a faster binary comparison instead.
Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes. Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
Use dedicated struct to represent nbtree insertion scan keys. Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions. This is based on a suggestion by Andrey
Lepikhov.
Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner. This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.
The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc(). This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column. Unique key binary
searches need to restore lower and upper bounds. They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result. However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions. Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.
The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.
To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN". That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.
This is a longstanding portability hazard, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
Add several assertions that ensure that we're dealing with a pivot tuple
without non-key attributes where that's expected. Also, remove the
assertion within _bt_isequal(), restoring the v10 function signature. A
similar check will be performed for the page highkey within
_bt_moveright() in most cases. Also avoid dropping all objects within
regression tests, to increase pg_dump test coverage for INCLUDE indexes.
Rather than using infrastructure that's generally intended to be used
with reference counted heap tuple descriptors during truncation, use the
same function that was introduced to store flat TupleDescs in shared
memory (we use a temp palloc'd buffer). This isn't strictly necessary,
but seems more future-proof than the old approach. It also lets us
avoid including rel.h within indextuple.c, which was arguably a
modularity violation. Also, we now call index_deform_tuple() with the
truncated TupleDesc, not the source TupleDesc, since that's more robust,
and saves a few cycles.
In passing, fix a memory leak by pfree'ing truncated pivot tuple memory
during CREATE INDEX. Also pfree during a page split, just to be
consistent.
Refactor _bt_check_natts() to be more readable.
Author: Peter Geoghegan with some editorization by me
Reviewed by: Alexander Korotkov, Teodor Sigaev
Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
This patch introduces INCLUDE clause to index definition. This clause
specifies a list of columns which will be included as a non-key part in
the index. The INCLUDE columns exist solely to allow more queries to
benefit from index-only scans. Also, such columns don't need to have
appropriate operator classes. Expressions are not supported as INCLUDE
columns since they cannot be used in index-only scans.
Index access methods supporting INCLUDE are indicated by amcaninclude flag
in IndexAmRoutine. For now, only B-tree indexes support INCLUDE clause.
In B-tree indexes INCLUDE columns are truncated from pivot index tuples
(tuples located in non-leaf pages and high keys). Therefore, B-tree indexes
now might have variable number of attributes. This patch also provides
generic facility to support that: pivot tuples contain number of their
attributes in t_tid.ip_posid. Free 13th bit of t_info is used for indicating
that. This facility will simplify further support of index suffix truncation.
The changes of above are backward-compatible, pg_upgrade doesn't need special
handling of B-tree indexes for that.
Bump catalog version
Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me
Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes,
David Rowley, Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
As src/backend/access/transam/README says, PageGetLSN may only be called
by processes holding either exclusive lock on buffer, or a shared lock
on buffer plus buffer header lock. Therefore any place that only holds
a shared buffer lock must use BufferGetLSNAtomic instead of PageGetLSN,
which internally obtains buffer header lock prior to reading the LSN.
A few callsites failed to comply with this rule. This was detected by
running all tests under a new (not committed) assertion that verifies
PageGetLSN locking contract. All but one of the callsites that failed
the assertion are fixed by this patch. Remaining callsites were
inspected manually and determined not to need any change.
The exception (unfixed callsite) is in TestForOldSnapshot, which only
has a Page argument, making it impossible to access the corresponding
Buffer from it. Fixing that seems a much larger patch that will have to
be done separately; and that's just as well, since it was only
introduced in 9.6 and other bugs are much older.
Some of these bugs are ancient; backpatch all the way back to 9.3.
Authors: Jacob Champion, Asim Praveen, Ashwin Agrawal
Reviewed-by: Michaël Paquier
Discussion: https://postgr.es/m/CABAq_6GXgQDVu3u12mK9O5Xt5abBZWQ0V40LZCE+oUf95XyNFg@mail.gmail.com
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
This isn't exposed to the optimizer or the executor yet; we'll add
support for those things in a separate patch. But this puts the
basic mechanism in place: several processes can attach to a parallel
btree index scan, and each one will get a subset of the tuples that
would have been produced by a non-parallel scan. Each index page
becomes the responsibility of a single worker, which then returns
all of the TIDs on that page.
Rahila Syed, Amit Kapila, Robert Haas, reviewed and tested by
Anastasia Lubennikova, Tushar Ahuja, and Haribabu Kommi.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35). The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.
As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column. Also provide the ability for an index
AM to override the behavior.
In passing, document pg_am.amtype, overlooked in commit 473b93287.
Andrew Gierth, with kibitzing by me and others
Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
An index search using a row comparison such as ROW(a, b) > ROW('x', 'y')
would stop upon reaching a NULL entry in the "b" column, ignoring the
fact that there might be non-NULL "b" values associated with later values
of "a". This happens because _bt_mark_scankey_required() marks the
subsidiary scankey for "b" as required, which is just wrong: it's for
a column after the one with the first inequality key (namely "a"), and
thus can't be considered a required match.
This bit of brain fade dates back to the very beginnings of our support
for indexed ROW() comparisons, in 2006. Kind of astonishing that no one
came across it before Glen Takahashi, in bug #14010.
Back-patch to all supported versions.
Note: the given test case doesn't actually fail in unpatched 9.1, evidently
because the fix for bug #6278 (i.e., stopping at nulls in either scan
direction) is required to make it fail. I'm sure I could devise a case
that fails in 9.1 as well, perhaps with something involving making a cursor
back up; but it doesn't seem worth the trouble.
606c0123d627 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.
Revert whole patch until we find a safe optimization.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one. The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.
The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.
This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits. These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged. If later patches allow all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.
Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.
Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.
WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.
Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.
Checksum used is WAL CRC-32 truncated to 16-bits.
Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
This fixes another error in commit 9e8da0f75731aaa7605cf4656c21ea09e84d2eb1.
I neglected to make the mark/restore functionality save and restore the
current set of array key values, which led to strange behavior if an
IndexScan with ScalarArrayOpExpr quals was used as the inner side of a
mergejoin. Per bug #7570 from Melese Tesfaye.
In commit e2c2c2e8b1df7dfdb01e7e6f6191a569ce3c3195 I made use of nested
list structures to show which clauses went with which index columns, but
on reflection that's a data structure that only an old-line Lisp hacker
could love. Worse, it adds unnecessary complication to the many places
that don't much care which clauses go with which index columns. Revert
to the previous arrangement of flat lists of clauses, and instead add a
parallel integer list of column numbers. The places that care about the
pairing can chase both lists with forboth(), while the places that don't
care just examine one list the same as before.
The only real downside to this is that there are now two more lists that
need to be passed to amcostestimate functions in case they care about
column matching (which btcostestimate does, so not passing the info is not
an option). Rather than deal with 11-argument amcostestimate functions,
pass just the IndexPath and expect the functions to extract fields from it.
That gets us down to 7 arguments which is better than 11, and it seems
more future-proof against likely additions to the information we keep
about an index path.
As pointed out by Naoya Anzai, my previous try at this was a few bricks
shy of a load, because I had forgotten that the initial-positioning logic
might not try to skip over nulls at the end of the index the scan will
start from. We ought to fix that, because it represents an unnecessary
inefficiency, but first let's get the scan-stop logic back to a safe
state. With this patch, we preserve the performance benefit requested
in bug #6278 for the case of scanning forward into NULLs (in a NULLS
LAST index), but the reverse case of scanning backward across NULLs
when there's no suitable initial-positioning qual is still inefficient.
The existing scan-direction-sensitive tests were overly complex, and
failed to stop the scan in cases where it's perfectly legitimate to do so.
Per bug #6278 from Maksym Boguk.
Back-patch to 8.3, which is as far back as the patch applies easily.
Doesn't seem worth sweating over a relatively minor performance issue in
8.2 at this late date. (But note that this was a performance regression
from 8.1 and before, so 8.2 is being left as an outlier.)
We copy all the matched tuples off the page during _bt_readpage, instead of
expensively re-locking the page during each subsequent tuple fetch. This
costs a bit more local storage, but not more than 2*BLCKSZ worth, and the
reduction in LWLock traffic is certainly worth that. What's more, this
lets us get rid of the API wart in the original patch that said an index AM
could randomly decline to supply an index tuple despite having asserted
pg_am.amcanreturn. That will be important for future improvements in the
index-only-scan feature, since the executor will now be able to rely on
having the index data available.
Such a condition is unsatisfiable in combination with any other type of
btree-indexable condition (since we assume btree operators are always
strict). 8.3 and 8.4 had an explicit test for this, which I removed in
commit 29c4ad98293e3c5cb3fcdd413a3f4904efff8762, mistakenly thinking that
the case would be subsumed by the more general handling of IS (NOT) NULL
added in that patch. Put it back, and improve the comments about it, and
add a regression test case.
Per bug #6079 from Renat Nasyrov, and analysis by Dean Rasheed.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...