This was not changed in HEAD, but will be done later as part of a
pgindent run. Future pgindent runs will also do this.
Report by Tom Lane
Backpatch through all supported branches, but not HEAD
Don't reset the rightlink of a page when replaying a page update record.
This was a leftover from pre-hot standby days, when it was not possible to
have scans concurrent with WAL replay. Resetting the right-link was not
necessary back then either, but it was done for the sake of tidiness. But
with hot standby, it's wrong, because a concurrent scan might still need it.
Backpatch all versions with hot standby, 9.0 and above.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.
There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.
The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
Commit d22a09dc70 introduced official support
for GiST consistentFns that want to cache data using the FmgrInfo fn_extra
pointer: the idea was to preserve the cached values across gistrescan(),
whereas formerly they'd been leaked. However, there was an oversight in
that, namely that multiple scan keys might reference the same column's
consistentFn; the code would result in propagating the same cache value
into multiple scan keys, resulting in crashes or wrong answers. Use a
separate array instead to ensure that each scan key keeps its own state.
Per bug #8143 from Joel Roller. Back-patch to 9.2 where the bug was
introduced.
After further reflection I was unconvinced that the existing coding is
guaranteed to return valid union datums in every code path for multi-column
indexes. Fix that by forcing a gistunionsubkey() call at the end of the
recursion. Having done that, we can remove some clearly-redundant calls
elsewhere. This should be a little faster for multi-column indexes (since
the previous coding would uselessly do such a call for each column while
unwinding the recursion), as well as much harder to break.
Also, simplify the handling of cases where one side or the other of a
primary split contains only don't-care tuples. The previous coding used a
very ugly hack in removeDontCares() that essentially forced one random
tuple to be treated as non-don't-care, providing a random initial choice of
seed datum for the secondary split. It seems unlikely that that method
will give better-than-random splits. Instead, treat such a split as
degenerate and just let the next column determine the split, the same way
that we handle fully degenerate cases where the two sides produce identical
union datums.
This LOG message was put in over five years ago with the evident
expectation that we'd make all GiST opclasses support secondary split
directly. However, no such thing ever happened, and indeed the number of
opclasses supporting it decreased to zero in 9.2. The reason is that
improving on the default implementation isn't that easy --- the
opclass-specific code that did exist, before 9.2, doesn't appear to have
been any improvement over the default.
Hence, remove the message altogether. There's certainly no point in
nagging users about this in released branches, but I doubt that we'll
ever implement complete opclass-specific support anyway.
Not only is this implementation of secondary-split not better than the
default implementation in gistsplit.c, it's actually worse. The gistsplit.c
code at least looks to see if switching the left and right sides would make
a better merge with the previously-split tuples, while this doesn't.
In any case it's rather useless to support secondary split only in an edge
case. There used to be more complete support for it here (in chooseLR()),
but that was removed in commit 7f3bd86843.
It appears to me though that the chooseLR() code was really isomorphic to
the default implementation, since it was still based on choosing the cheaper
way of adding two sub-split vectors that had been chosen without regard to
the primary split initially. I think an implementation of secondary split
that could beat the default implementation would have to be pretty fully
integrated into the split algorithm, not plastered on at the end.
Back-patch to 9.2, but not further; previous branches have the chooseLR()
code which I don't feel a great need to mess with. This is mainly so we
just have two behaviors and not three among the various branches (IOW, this
patch is cleanup for commit 7f3bd86843e5aad84585a57d3f6b80db3c609916's
incomplete removal of secondary-split support).
Improve comments, rename some variables and functions, slightly simplify
a couple of APIs, in an attempt to make this code readable by people other
than its original author.
Even though this is essentially just cosmetic, back-patch to all active
branches, because otherwise it's going to make back-patching future fixes
in this file very painful.
While there's considerable doubt that we want fuzzy behavior in the
geometric operators at all (let alone as currently implemented), nobody is
stepping forward to redesign that stuff. In the meantime it behooves us
to make sure that index searches agree with the behavior of the underlying
operators. This patch fixes two problems in this area.
First, gist_box_same was using fuzzy equality, but it really needs to use
exact equality to prevent not-quite-identical upper index keys from being
treated as identical, which for example would prevent an existing upper
key from being extended by an amount less than epsilon. This would result
in inconsistent indexes. (The next release notes will need to recommend
that users reindex GiST indexes on boxes, polygons, circles, and points,
since all four opclasses use gist_box_same.)
Second, gist_point_consistent used exact comparisons for upper-page
comparisons in ~= searches, when it needs to use fuzzy comparisons to
ensure it finds all matches; and it used fuzzy comparisons for point <@ box
searches, when it needs to use exact comparisons because that's what the
<@ operator (rather inconsistently) does.
The added regression test cases illustrate all three misbehaviors.
Back-patch to all active branches. (8.4 did not have GiST point_ops,
but it still seems prudent to apply the gist_box_same patch to it.)
Alexander Korotkov, reviewed by Noah Misch
When considering a non-last column in a multi-column GiST index,
gistsplit.c tries to improve on the split chosen by the opclass-specific
pickSplit function by considering penalties for the next column. However,
there were two bugs in this code: it failed to recompute the union keys for
the leftmost index columns, even though these might well change after
reassigning tuples; and it included the old union keys in the recomputation
for the columns it did recompute, so that those keys couldn't get smaller
even if they should. The first problem could result in an invalid index
in which searches wouldn't find index entries that are in fact present;
the second would make the index less efficient to search.
Both of these errors were caused by misuse of gistMakeUnionItVec, whose
API was designed in a way that just begged such errors to be made. There
is no situation in which it's safe or useful to compute the union keys for
a subset of the index columns, and there is no caller that wants any
previous union keys to be included in the computation; so the undocumented
choice to treat the union keys as in/out rather than pure output parameters
is a waste of code as well as being dangerous.
Hence, rather than just making a minimal patch, I've changed the API of
gistMakeUnionItVec to remove the "startkey" parameter (it now always
processes all index columns) and treat the attr/isnull arrays as purely
output parameters.
In passing, also get rid of a couple of unnecessary and dangerous uses
of static variables in gistutil.c. It's remarkable that the one in
gistMakeUnionKey hasn't given us portability troubles before now, because
in addition to posing a re-entrancy hazard, it was unsafely assuming that
a static char[] array would have at least Datum alignment.
Per investigation of a trouble report from Tomas Vondra. (There are also
some bugs in contrib/btree_gist to be fixed, but that seems like material
for a separate patch.) Back-patch to all supported branches.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
This back-ports commits c8ba697a4b and
e5db11c558, which fix one definite and one
speculative bug in gistchoose, and make the code a lot more intelligible as
well. In 9.2 only, this also affects the largely-copied-and-pasted logic
in gistRelocateBuildBuffersOnSplit.
The impact of the bugs was that the functions might make poor decisions
as to which index tree branch to push a new entry down into, resulting in
GiST index bloat and poor performance. The fixes rectify these decisions
for future insertions, but a REINDEX would be needed to clean up any
existing index bloat.
Alexander Korotkov, Robert Haas, Tom Lane
We use a hash table to track the parents of inner pages, but when inserting
to a leaf page, the caller of gistbufferinginserttuples() must pass a
correct block number of the leaf's parent page. Before gistProcessItup()
descends to a child page, it checks if the downlink needs to be adjusted to
accommodate the new tuple, and updates the downlink if necessary. However,
updating the downlink might require splitting the page, which might move the
downlink to a page to the right. gistProcessItup() doesn't realize that, so
when it descends to the leaf page, it might pass an out-of-date parent block
number as a result. Fix that by returning the block a tuple was inserted to
from gistbufferinginserttuples().
This fixes the bug reported by Zdeněk Jílovec.
We used to mimic the way a stack is constructed when descending the tree
during normal GiST inserts, but that was quite complicated during a buffered
build. It was also wrong: in GiST, the left-to-right relationships on
different levels might not match each other, so that when you know the
parent of a child page, you won't necessarily find the parent of the page to
the right of the child page by following the rightlinks at the parent level.
This sometimes led to "could not re-find parent" errors while building a
GiST index.
We now use a simple hash table to track the parent of every internal page.
Whenever a page is split, and downlinks are moved from one page to another,
we update the hash table accordingly. This is also better for performance
than the old method, as we never need to move right to re-find the parent
page, which could take a significant amount of time for buffers that were
created much earlier in the index build.
There were two bugs here: We forgot to call gistFreeBuildBuffers() function
at the end of build, and we passed interXact == true to BufFileCreateTemp,
so the file wasn't automatically cleaned up at end-of-transaction either.
The result of (maintenance_work_mem * 1024) / BLCKSZ doesn't fit in a signed
32-bit integer, if maintenance_work_mem >= 2GB. Use double instead. And
while we're at it, write the calculations in an easier to understand form,
with the intermediary steps written out and commented.
When we create a temporary copy of the old node buffer, in stack, we mustn't
leak that into any of the long-lived data structures. Before this patch,
when we called gistPopItupFromNodeBuffer(), it got added to the array of
"loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the
pointer added to the loaded buffers array points to garbage. Often that goes
unnotied, because when we go through the array of loaded buffers to unload
them, buffers with a NULL pageBuffer are ignored, which can often happen by
accident even if the pointer points to garbage.
This patch fixes that by marking the temporary copy in stack explicitly as
temporary, and refrain from adding buffers marked as temporary to the array
of loaded buffers.
While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber
and improve comments a bit. This isn't strictly necessary, but makes
debugging easier.
When inserting the downlinks for a split gist page, we used hold the locks
on the child pages until the insertion into the parent - and recursively its
parent if it had to be split too - were all completed. Change that so that
the locks on child pages are released after the insertion in the immediate
parent is done, before recursing further up the tree.
This reduces the number of lwlocks that are held simultaneously. Holding
many locks is bad for concurrency, and in extreme cases you can even hit
the limit of 100 simultaneously held lwlocks in a backend. If you're really
unlucky, you can hit the limit while in a critical section, which brings
down the whole system.
This fixes bug #6629 reported by Tom Forbes. Backpatch to 9.1. The page
splitting code was rewritten in 9.1, and the old code did not have this
problem.
Previously it was thought that it's impossible as the code stands, because
insertions create buffers as tuples are cascaded downwards, and index
split also creaters buffers eagerly for all halves. But the example from
Jay Levitt demonstrates that it can happen, when the root page is split.
It's in fact OK if the buffer doesn't exist, so we just need to remove the
sanity check. In fact, we've been discussing the possibility of destroying
empty buffers to conserve memory, which would render the sanity check
completely useless anyway.
Fix by Alexander Korotkov
Use of a randomly chosen large value was never exactly graceful, and
now that there are penalty functions that are intentionally using infinity,
it doesn't seem like a good idea for null-vs-not-null to be using something
less.
The original idea of this patch was to make box picksplit run faster, by
eliminating unnecessary palloc() overhead, but that was obsoleted by the new
double-sorting split algorithm that doesn't call these functions so heavily
anymore. Nevertheless, the code looks better this way.
Original patch by me, reviewed and tidied up after the double-sorting patch
by Kevin Grittner.
pg_trgm was already doing this unofficially, but the implementation hadn't
been thought through very well and leaked memory. Restructure the core
GiST code so that it actually works, and document it. Ordinarily this
would have required an extra memory context creation/destruction for each
GiST index search, but I was able to avoid that in the normal case of a
non-rescanned search by finessing the handling of the RBTree. It used to
have its own context always, but now shares a context with the
scan-lifespan data structures, unless there is more than one rescan call.
This should make the added overhead unnoticeable in typical cases.
This oversight led to a massive memory leak --- upwards of 10KB per tuple
--- during creation-time verification of an exclusion constraint based on a
GIST index. In most other scenarios it'd just be a leak of 10KB that would
be recovered at end of query, so not too significant; though perhaps the
leak would be noticeable in a situation where a GIST index was being used
in a nestloop inner indexscan. In any case, it's a real leak of long
standing, so patch all supported branches. Per report from Harald Fuchs.
queuedForEmptying flag correctly on buffer when adding it to the queue.
Also, don't add buffer to the queue if it's there already. These were
harmless oversights; failing to set the flag just means that a buffer might
get added to the queue twice if more tuples are added to it (although that
can't actually happen at this point because all the upper buffers have
already been emptied), and having the same buffer twice in the emptying
queue is harmless. But better be tidy.
When building a GiST index that doesn't fit in cache, buffers are attached
to some internal nodes in the index. This speeds up the build by avoiding
random I/O that would otherwise be needed to traverse all the way down the
tree to the find right leaf page for tuple.
Alexander Korotkov
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
First, when following a right-link, we incorrectly marked the current page
as the parent of the right sibling. In reality, the parent of the right page
is the same as the parent of the current page (or some page to the right of
it, gistFindCorrectParent() will sort that out).
Secondly, when we follow a right-link, we must prepend, not append, the right
page to our list of pages to visit. That's because we assume that once we
hit a leaf page in the list, all the rest are leaf pages too, and give up.
To hit these bugs, you need concurrent actions and several unlucky accidents.
Another backend must split the root page, while you're in process of
splitting a lower-level page. Furthermore, while you scan the internal nodes
to re-find the parent, another backend needs to again split some more internal
pages. Even then, the bugs don't necessarily manifest as user-visible errors
or index corruption.
While we're at it, make the error reporting a bit better if gistFindPath()
fails to re-find the parent. It used to be an assertion, but an elog() seems
more appropriate.
Backpatch to all supported branches.
Apparently sane-looking penalty code might return small negative values,
for example because of roundoff error. This will confuse places like
gistchoose(). Prevent problems by clamping negative penalty values to
zero. (Just to be really sure, I also made it force NaNs to zero.)
Back-patch to all supported branches.
Alexander Korotkov
Experimentation with contrib/btree_gist shows that the majority of the GIST
support functions potentially need collation information. Safest policy
seems to be to pass it to all of them, instead of making assumptions about
which ones could possibly need it.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
This warning is new in gcc 4.6 and part of -Wall. This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.