Make some of the variable names in _bt_search() consistent with
corresponding variables within _bt_getstackbuf(). This naming scheme is
clearer because the variable names always express a relationship between
the currently locked buffer/page and some other page.
Use separate functions to save and restore error context information as
that made code easier to understand. Also, make it clear that the index
information required for error context is sane.
Author: Andres Freund, Justin Pryzby, Amit Kapila
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CAA4eK1LWo+v1OWu=Sky27GTGSCuOmr7iaURNbc5xz6jO+SaPeA@mail.gmail.com
Commit 0d861bbb, which added deduplication to nbtree, had
_bt_check_unique() pass a TID to table_index_fetch_tuple_check() that
isn't safe to mutate. table_index_fetch_tuple_check()'s tid argument is
modified when the TID in question is not the latest visible tuple in a
hot chain, though this wasn't documented.
To fix, go back to using a local copy of the TID in _bt_check_unique(),
and update comments above table_index_fetch_tuple_check().
Backpatch: 13-, where B-Tree deduplication was introduced.
In pg_replication_slot, change output from normal/reserved/lost to
reserved/extended/unreserved/ lost, which better expresses the possible
states particularly near the time where segments are no longer safe but
checkpoint has not run yet.
Under the new definition, reserved means the slot is consuming WAL
that's still under the normal WAL size constraints; extended means it's
consuming WAL that's being protected by wal_keep_segments or the slot
itself, whose size is below max_slot_wal_keep_size; unreserved means the
WAL is no longer safe, but checkpoint has not yet removed those files.
Such as slot is in imminent danger, but can still continue for a little
while and may catch up to the reserved WAL space.
Also, there were some bugs in the calculations used to report the
status; fixed those.
Backpatch to 13.
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
The current definition is dangerous. No bugs exist in our code at
present, but backpatch to 11 nonetheless where it was introduced.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
spg_mask() didn't take into account that pd_lower equal to SizeOfPageHeaderData
is still valid value. This commit fixes that. Backpatch to 11, where
spg_mask() pg_lower check was introduced.
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/20200615131405.GM52676%40paquier.xyz
Backpatch-through: 11
The function has been reading global variable forceSyncCommit, mirroring
the intent of the caller that passed forceSync=forceSyncCommit. The
other caller, RecordTransactionCommitPrepared(), passed false. Since
COMMIT PREPARED can't share a transaction with any command, it certainly
doesn't share a transaction with a command that sets forceSyncCommit.
Reviewed by Michael Paquier.
Discussion: https://postgr.es/m/20200617032615.GC2916904@rfd.leadboat.com
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits. This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.
To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple. This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.
Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
Commit 72d422a5227ef6f76f412486a395aba9f53bf3f0 made xlog.c call
sendTablespace() with the 'sizeonly' argument set to true, which
required basebackup.c to export sendTablespace(). However, that's
kind of ugly, so instead defer the call to sendTablespace() until
basebackup.c regains control. That way, it can still be a static
function.
Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi.
Discussion: http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com
Convert buffile.c error handling to use ereport. This fixes cases where
I/O errors were indistinguishable from EOF or not reported. Also remove
"%m" from error messages where errno would be bogus. While we're
modifying those strings, add block numbers and short read byte counts
where appropriate.
Back-patch to all supported releases.
Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
Core by no means makes excessive use of these functions, but quite a large
number of those usages do require the caller to call strlen() on the
returned string. This is quite wasteful since these functions do already
have a good idea of the length of the string, so we might as well just
have them return that.
Reviewed-by: Andrew Gierth
Discussion: https://postgr.es/m/CAApHDvrm2A5x2uHYxsqriO2cUaGcFvND%2BksC9e7Tjep0t2RK_A%40mail.gmail.com
Rewrite the documentation of these functions, in light of recent bug fix
commit 5940ffb2.
Back-patch to 13 where the check-for-conflict-out code was split up into
AM-specific and generic parts, and new documentation was added that now
looked wrong.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io
SSI's HeapCheckForSerializableConflictOut() test failed to correctly
handle conditions involving a concurrently inserted tuple which is later
concurrently updated by a separate transaction . A SELECT statement
that called HeapCheckForSerializableConflictOut() could end up using the
same XID (updater's XID) for both the original tuple, and the successor
tuple, missing the XID of the xact that created the original tuple
entirely. This only happened when neither tuple from the chain was
visible to the transaction's MVCC snapshot.
The observable symptoms of this bug were subtle. A pair of transactions
could commit, with the later transaction failing to observe the effects
of the earlier transaction (because of the confusion created by the
update to the non-visible row). This bug dates all the way back to
commit dafaa3ef, which added SSI.
To fix, make sure that we check the xmin of concurrently inserted tuples
that happen to also have been updated concurrently.
Author: Peter Geoghegan
Reported-By: Kyle Kingsbury
Reviewed-By: Thomas Munro
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443@jepsen.io
Backpatch: All supported versions
The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.
Likewise, XLogReportParameters() must acquire ControlFileLock before
modifying ControlFile and calling UpdateControlFile().
Back-patch to all supported releases.
Author: Nathan Bossart <bossartn@amazon.com>
Author: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
Since database connections can be used with WAL senders in 9.4, it is
possible to use physical replication. This commit fixes a crash when
starting physical replication with a WAL sender using a database
connection, caused by the refactoring done in 850196b.
There have been discussions about forbidding the use of physical
replication in a database connection, but this is left for later,
taking care only of the crash new to 13.
While on it, add a test to check for a failure when attempting logical
replication if the WAL sender does not have a database connection. This
part is extracted from a larger patch by Kyotaro Horiguchi.
Reported-by: Vladimir Sitnikov
Author: Michael Paquier, Kyotaro Horiguchi
Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera
Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com
Backpatch-through: 13
Commit 624686abcf8 added an assertion that verified that _bt_search
successfully relocated the leaf page undergoing deletion. Page deletion
cannot deal with the case where the descent stack is to the right of the
page, so this seemed critical (deletion can only handle the case where
the descent stack is to the left of the leaf/target page). However, the
assertion went a bit too far.
Since only a buffer pin is held on the leaf page throughout the call to
_bt_search, nothing guarantees that it can't have split during this
small window. And if does actually split, _bt_search may end up
"relocating" a page to the right of the original target leaf page. This
scenario seems extremely unlikely, but it must still be considered.
Remove the assertion, and document how we cope in this scenario.
Synchronize the event names for parallel hash join waits with other
event names, by getting rid of the slashes and dropping "-ing"
suffixes. Rename ClogGroupUpdate to XactGroupUpdate, to match the
new SLRU name. Move the ProcSignalBarrier event to the IPC category;
it doesn't belong under IO.
Also a bit more wordsmithing in the wait event documentation tables.
Discussion: https://postgr.es/m/4505.1589640417@sss.pgh.pa.us
Choose names that fit into the conventions for wait event names
(particularly, that multi-word names are in the style MultiWordName)
and hopefully convey more information to non-hacker users than the
previous names did.
Also rename SerializablePredicateLockListLock to
SerializablePredicateListLock; the old name was long enough to cause
table formatting problems, plus the double occurrence of "Lock" seems
confusing/error-prone.
Also change a couple of particularly opaque LWLock field names.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
_bt_killitems marks btree items dead when a scan leaves the page where
they live, but it does so with only share lock (to improve concurrency).
This was historicall okay, since killing a dead item has no
consequences. However, with the advent of data checksums and
wal_log_hints, this action incurs a WAL full-page-image record of the
page. Multiple concurrent processes would write the same page several
times, leading to WAL bloat. The probability of this happening can be
reduced by only killing items if they're not already dead, so change the
code to do that.
The problem could eliminated completely by having _bt_killitems upgrade
to exclusive lock upon seeing a killable item, but that would reduce
concurrency so it's considered a cure worse than the disease.
Backpatch all the way back to 9.5, since wal_log_hints was introduced in
9.4.
Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
Originally, the names assigned to SLRUs had no purpose other than
being shmem lookup keys, so not a lot of thought went into them.
As of v13, though, we're exposing them in the pg_stat_slru view and
the pg_stat_reset_slru function, so it seems advisable to take a bit
more care. Rename them to names based on the associated on-disk
storage directories (which fortunately we *did* think about, to some
extent; since those are also visible to DBAs, consistency seems like
a good thing). Also rename the associated LWLocks, since those names
are likewise user-exposed now as wait event names.
For the most part I only touched symbols used in the respective modules'
SimpleLruInit() calls, not the names of other related objects. This
renaming could have been taken further, and maybe someday we will do so.
But for now it seems undesirable to change the names of any globally
visible functions or structs, so some inconsistency is unavoidable.
(But I *did* terminate "oldserxid" with prejudice, as I found that
name both unreadable and not descriptive of the SLRU's contents.)
Table 27.12 needs re-alphabetization now, but I'll leave that till
after the other LWLock renamings I have in mind.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
There is little point in using the LWLockRegisterTranche mechanism for
built-in tranche names. It wastes cycles, it creates opportunities for
bugs (since failing to register a tranche name is a very hard-to-detect
problem), and the lack of any centralized list of names encourages
sloppy nonconformity in name choices. Moreover, since we have a
centralized list of the tranches anyway in enum BuiltinTrancheIds, we're
certainly not buying any flexibility in return for these disadvantages.
Hence, nuke all the backend-internal LWLockRegisterTranche calls,
and instead provide a const array of the builtin tranche names.
(I have in mind to change a bunch of these names shortly, but this
patch is just about getting them into one place.)
Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
The additional pain from level 4 is excessive for the gain.
Also revert all the source annotation changes to their original
wordings, to avoid back-patching pain.
Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
* Have both physical and logical walsender share a 'xlogreader' state
struct for tracking state. This replaces the existing globals sendSeg
and sendCxt.
* Change WALRead not to receive XLogReaderState->seg and ->segcxt as
separate arguments anymore; just use the ones from 'state'. This is
made possible by the above change.
* have the XLogReader segment_open contract require the callbacks to
install the file descriptor in the state struct themselves instead of
returning it. xlogreader was already ignoring any possible failed
return from the callbacks, relying solely on them never returning.
(This point is not altogether excellent, as it means the callbacks
have to know more of XLogReaderState; but to really improve on that
we would have to pass back error info from the callbacks to
xlogreader. And the complexity would not be saved but instead just
transferred to the callers of WALRead, which would have to learn how
to throw errors from the open_segment callback in addition of, as
currently, from pg_pread.)
* segment_open no longer receives the 'segcxt' as a separate argument,
since it's part of the XLogReaderState argument.
Per comments from Kyotaro Horiguchi.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
Use it at level 4, a bit more restrictive than the default level, and
tweak our commanding comments to FALLTHROUGH.
(However, leave zic.c alone, since it's external code; to avoid the
warnings that would appear there, change CFLAGS for that file in the
Makefile.)
Author: Julien Rouhaud <rjuju123@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol
Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org
The existing callers of XLogReadDetermineTimeline() performing recovery
need to check a replay LSN position when determining on which timeline
to read a WAL page. A portion of the comment describing this function
said exactly that, while referring to a routine for fetching a write
LSN, something not available in recovery.
Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com
Restructure the function that locates the root of the to-be-deleted
subtree during nbtree page deletion. Handle the conditions that make
page deletion unsafe in a slightly more uniform way, and acknowledge the
fact that the behavior with incomplete splits on internal pages is
different (as pointed out in the nbtree README as of commit 35bc0ec7).
Also invent new terminology that avoids ambiguity around which pages are
about to be deleted. Consistently use the term "to-be-deleted subtree",
not the ambiguous term "branch".
We were calling the subtree parent page the "top parent page", but that
was quite misleading. The top parent page usually refers to a page
unlinked from its siblings and marked deleted (during the second stage
of page deletion). There was one kind of top parent page that we merely
removed a downlink from, and another kind of top parent page that we
actually marked deleted. Eliminate the ambiguity by inventing a new
term ("subtree parent page") that refers to the former kind of page
only.
Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7.
XLogReader's system for opening and closing segments had gotten too
complicated, with callbacks being passed at both the XLogReaderAllocate
level (read_page) as well as at the WALRead level (segment_open). This
was confusing and hard to follow, so restructure things so that these
callbacks are passed together at XLogReaderAllocate time, and add
another callback to the set (segment_close) to make it a coherent whole.
Also, ensure XLogReaderState is an argument to all the callbacks, so
that they can grab at the ->private data if necessary.
Document the whole arrangement more clearly.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
TimelineHistoryRead and TimelineHistoryWrite wait events are reported
during waiting for a read and write of a timeline history file, respectively.
However, previously, TimelineHistoryRead wait event was not reported
while readTimeLineHistory() was reading a timeline history file. Also
TimelineHistoryWrite was not reported while writeTimeLineHistory() was
writing one line with the details of the timeline split, at the end.
This commit fixes these issues.
Back-patch to v10 where wait events for a timeline history file was added.
Author: Masahiro Ikeda
Reviewed-by: Michael Paquier, Fujii Masao
Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
Factor out code common to _bt_lock_branch_parent() and _bt_pagedel()
into a new utility function. This new function is used to check that
the left sibling of a deletion target page does not have the
INCOMPLETE_SPLIT page flag set. If it is set then deletion is unsafe;
there won't be a usable pivot tuple (with a downlink) in the parent page
that points to the deletion target page. The page deletion algorithm is
not prepared to deal with that. Also restructure an existing, related
utility function that checks if the right sibling of the target page has
the ISHALFDEAD page flag set.
This organization highlights the symmetry between the two cases. The
goal is to make the design of page deletion clearer. Both functions
involve a sibling page with a flag that indicates that there was an
interrupted operation (a page split or a page deletion) that resulted in
a page pointed to by sibling pages, but not pointed to in the parent.
And, both functions indicate if page deletion is unsafe due to the
absence of a particular downlink in the parent page.
In commit 33e05f89c5, we have added the option to display WAL usage
statistics in Explain and auto_explain. The display format used two spaces
between each field which is inconsistent with Buffer usage statistics which
is using one space between each field. Change the format to make WAL usage
statistics consistent with Buffer usage statistics.
This commit also changed the usage of "full page writes" to
"full page images" for WAL usage statistics to make it consistent with
other parts of code and docs.
Author: Julien Rouhaud, Amit Kapila
Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
Remove one of the arguments to btvacuumpage(), and give up on the idea
that it's a recursive function. We now use the term "backtracking" to
refer to the case where an earlier block must be visited to make sure no
tuples that need to be removed were missed.
Advertising btvacuumpage() as a recursive function was unhelpful. In
reality the function always simulates recursion with a loop (it doesn't
actually call itself). This wasn't just necessary as a precaution (per
the comments mentioning tail recursion), though. There is no reliable
natural limit on the number of times we can backtrack.
There are important behavioral difference when "recursing"/backtracking,
mostly related to page deletion. We don't perform page deletion when
backtracking due to the extra complexity. And when we recurse, we're
not performing a physical order scan anymore, so we expect fairly
different conditions to hold for the page. Structuring the code like
this makes it clearer how _bt_pagedel() cooperates with btvacuumpage()
and btvacuumscan() (as established in commit b0229f26 and commit
73a076b0).
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com
On further reflection, code comments added by commit b0229f26 slightly
misrepresented how we determine the oldest bpto.xact for the index.
btvacuumpage() does not treat the bpto.xact of a page that it put in the
FSM as a candidate to be the oldest deleted page (the delete-marked page
that has the oldest bpto.xact XID among all pages encountered).
The definition of a deleted page for the purposes of the bpto.xact
calculation is different from the definition used by the bulk delete
statistics. The bulk delete statistics don't distinguish between pages
that were deleted by the current VACUUM, pages deleted by a previous
VACUUM operation but not yet recyclable/reusable, and pages that are
reusable (though reusable pages are counted separately).
Backpatch: 11-, just like commit b0229f26.
The logic for determining how many nbtree pages in an index are deleted
pages sometimes undercounted pages. Pages that were deleted by the
current VACUUM operation (as opposed to some previous VACUUM operation
whose deleted pages have yet to be reused) were sometimes overlooked.
The final count is exposed to users through VACUUM VERBOSE's "%u index
pages have been deleted" output.
btvacuumpage() avoided double-counting when _bt_pagedel() deleted more
than one page by assuming that only one page was deleted, and that the
additional deleted pages would get picked up during a future call to
btvacuumpage() by the same VACUUM operation. _bt_pagedel() can
legitimately delete pages that the btvacuumscan() scan will not visit
again, though, so that assumption was slightly faulty.
Fix the accounting by teaching _bt_pagedel() about its caller's
requirements. It now only reports on pages that it knows btvacuumscan()
won't visit again (including the current btvacuumpage() page), so
everything works out in the end.
This bug has been around forever. Only backpatch to v11, though, to
keep _bt_pagedel() is sync on the branches that have today's bugfix
commit b0229f26da. Note that this commit changes the signature of
_bt_pagedel(), just like commit b0229f26da.
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-
Commit 857f9c36cda (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole. It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).
The most immediate problem was that the special area of the page was
examined without a buffer pin at one point. More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors. For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page. The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.
It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for. The issue was spotted during
unrelated code review.
To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel(). btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()). Note that
_bt_pagedel() should never encounter an existing deleted page.
This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel(). However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM). It seems highly unlikely that such an
extension actually exists, in any case.
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
Avoid accessing the leaf page's top parent tuple without a buffer lock
held during the second phase of nbtree page deletion. The old approach
was safe, though only because VACUUM never drops its buffer pin (and
because only VACUUM itself can modify a half-dead page). Even still, it
seems like a good idea to be strict here. Tighten things up by copying
the top parent page's block number to a local variable before releasing
the buffer lock on the leaf page -- not after.
This is a follow-up to commit fa7ff642, which fixed a similar issue in
the first phase of nbtree page deletion.
Update some related comments in passing.
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
Avoid accessing the deletion target page's special area during nbtree
page deletion at a point where there is no buffer lock held. This issue
was detected by a patch that extends Valgrind's memcheck tool to mark
nbtree pages that are unsafe to access (due to not having a buffer lock
or buffer pin) as NOACCESS.
We do hold a buffer pin at this point, and only access the special area,
so the old approach was safe. Even still, it seems like a good idea to
tighten up the rules in this area. There is no reason to not simply
insist on always holding a buffer lock (not just a pin) when accessing
nbtree pages.
Update some related comments in passing.
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com