Every core SLRU wraps around. With the exception of pg_notify, the wrap
point can fall in the middle of a page. Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback. Update each callback implementation to fit the new
specification. This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.) The bug fixed here has the same symptoms and user
followup steps as 592a589a04. Back-patch
to 9.5 (all supported versions).
Reviewed by Andrey Borodin and (in earlier versions) by Tom Lane.
Discussion: https://postgr.es/m/20190202083822.GC32531@gust.leadboat.com
macOS's ranlib issued a warning about an empty pread.o file with the
previous arrangement, on systems new enough to require no replacement
functions. Let's go back to using configure's AC_REPLACE_FUNCS system
to build and include each .o in the library only if it's needed, which
requires moving the *v() functions to their own files.
Also move the _with_retry() wrapper to a more permanent home.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1283127.1610554395%40sss.pgh.pa.us
This is a follow-up of the work done in commit 0650ff2303. This commit
extends log_recovery_conflict_waits so that a log message is produced
also when recovery conflict has already been resolved after deadlock_timeout
passes, i.e., when the startup process finishes waiting for recovery
conflict after deadlock_timeout. This is useful in investigating how long
recovery conflicts prevented the recovery from applying WAL.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Bertrand Drouvot
Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
Similar to commit d6ad34f341, this patch optimizes
DropRelFileNodesAllBuffers() by avoiding the complete buffer pool scan and
instead find the buffers to be invalidated by doing lookups in the
BufMapping table.
This optimization helps operations where the relation files need to be
removed like Truncate, Drop, Abort of Create Table, etc.
Author: Kirk Jamison
Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
The recovery path of DropRelFileNodeBuffers() is optimized so that
scanning of the whole buffer pool can be avoided when the number of
blocks to be truncated in a relation is below a certain threshold. For
such cases, we find the buffers by doing lookups in BufMapping table.
This improves the performance by more than 100 times in many cases
when several small tables (tested with 1000 relations) are truncated
and where the server is configured with a large value of shared
buffers (greater than equal to 100GB).
This optimization helps cases (a) when vacuum or autovacuum truncated off
any of the empty pages at the end of a relation, or (b) when the relation is
truncated in the same transaction in which it was created.
This commit introduces a new API smgrnblocks_cached which returns a cached
value for the number of blocks in a relation fork. This helps us to determine
the exact size of relation which is required to apply this optimization. The
exact size is required to ensure that we don't leave any buffer for the
relation being dropped as otherwise the background writer or checkpointer
can lead to a PANIC error while flushing buffers corresponding to files that
don't exist.
Author: Kirk Jamison based on ideas by Amit Kapila
Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila
Tested-By: Haiying Tang
Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
This commit adds GUC log_recovery_conflict_waits that controls whether
a log message is produced when the startup process is waiting longer than
deadlock_timeout for recovery conflicts. This is useful in determining
if recovery conflicts prevent the recovery from applying WAL.
Note that currently a log message is produced only when recovery conflict
has not been resolved yet even after deadlock_timeout passes, i.e.,
only when the startup process is still waiting for recovery conflict
even after deadlock_timeout.
Author: Bertrand Drouvot, Masahiko Sawada
Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi, Fujii Masao
Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
This GUC variable works much like idle_in_transaction_session_timeout,
in that it kills sessions that have waited too long for a new client
query. But it applies when we're not in a transaction, rather than
when we are.
Li Japin, reviewed by David Johnston and Hayato Kuroda, some
fixes by me
Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
The deadlocks that the recovery conflict on lock is involved in can
happen between hot-standby backends and the startup process.
If a backend takes an access exclusive lock on the table and which
finally triggers the deadlock, that deadlock can be detected
as expected. On the other hand, previously, if the startup process
took an access exclusive lock and which finally triggered the deadlock,
that deadlock could not be detected and could remain even after
deadlock_timeout passed. This is a bug.
The cause of this bug was that the code for handling the recovery
conflict on lock didn't take care of deadlock case at all. It assumed
that deadlocks involving the startup process and backends were able
to be detected by the deadlock detector invoked within backends.
But this assumption was incorrect. The startup process also should
have invoked the deadlock detector if necessary.
To fix this bug, this commit makes the startup process invoke
the deadlock detector if deadlock_timeout is reached while handling
the recovery conflict on lock. Specifically, in that case, the startup
process requests all the backends holding the conflicting locks to
check themselves for deadlocks.
Back-patch to v9.6. v9.5 has also this bug, but per discussion we decided
not to back-patch the fix to v9.5. Because v9.5 doesn't have some
infrastructure codes (e.g., 37c54863cf) that this bug fix patch depends on.
We can apply those codes for the back-patch, but since the next minor
version release is the final one for v9.5, it's risky to do that. If we
unexpectedly introduce new bug to v9.5 by the back-patch, there is no
chance to fix that. We determined that the back-patch to v9.5 would give
more risk than gain.
Author: Fujii Masao
Reviewed-by: Bertrand Drouvot, Masahiko Sawada, Kyotaro Horiguchi
Discussion: https://postgr.es/m/4041d6b6-cf24-a120-36fa-1294220f8243@oss.nttdata.com
The logic for determining the latest removed XID for the purposes of
generating recovery conflicts in REDO routines was subtly broken. It
failed to follow links from HOT chains, and so failed to consider all
relevant heap tuple headers in some cases.
To fix, expand the loop that deals with LP_REDIRECT line pointers to
also deal with HOT chains. The new version of the loop is loosely based
on a similar loop from heap_prune_chain().
The impact of this bug is probably quite limited, since the horizon code
necessarily deals with heap tuples that are pointed to by LP_DEAD-set
index tuples. The process of setting LP_DEAD index tuples (e.g. within
the kill_prior_tuple mechanism) is highly correlated with opportunistic
pruning of pointed-to heap tuples. Plus the question of generating a
recovery conflict usually comes up some time after index tuple LP_DEAD
bits were initially set, unlike heap pruning, where a latestRemovedXid
is generated at the point of the pruning operation (heap pruning has no
deferred "would-be page split" style processing that produces conflicts
lazily).
Only backpatch to Postgres 12, the first version where this logic runs
during original execution (following commit 558a9165e0). The index
latestRemovedXid mechanism has had the same bug since it first appeared
over 10 years ago (in commit a760893d), but backpatching to all
supported versions now seems like a bad idea on balance. Running the
new improved code during recovery seems risky, especially given the lack
of complaints from the field.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=Eib393+HHcERK_9MtgNS7Ew1HY=RDC_g6GL46zM5C6Q@mail.gmail.com
Backpatch: 12-
The patch needs test cases, reorganization, and cfbot testing.
Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive)
and 08db7c63f3..ccbe34139b.
Reported-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
This adds a key management system that stores (currently) two data
encryption keys of length 128, 192, or 256 bits. The data keys are
AES256 encrypted using a key encryption key, and validated via GCM
cipher mode. A command to obtain the key encryption key must be
specified at initdb time, and will be run at every database server
start. New parameters allow a file descriptor open to the terminal to
be passed. pg_upgrade support has also been added.
Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com
Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us
Author: Masahiko Sawada, me, Stephen Frost
Up to now, if the DBA issued "pg_ctl stop -m immediate", the message
sent to clients was the same as for a crash-and-restart situation.
This is confusing, not least because the message claims that the
database will soon be up again, something we have no business
predicting.
Improve things so that we can generate distinct messages for the two
cases (and also recognize an ad-hoc SIGQUIT, should somebody try that).
To do that, add a field to pmsignal.c's shared memory data structure
that the postmaster sets just before broadcasting SIGQUIT to its
children. No interlocking seems to be necessary; the intervening
signal-sending and signal-receipt should sufficiently serialize accesses
to the field. Hence, this isn't any riskier than the existing usages
of pmsignal.c.
We might in future extend this idea to improve other
postmaster-to-children signal scenarios, although none of them
currently seem to be as badly overloaded as SIGQUIT.
Discussion: https://postgr.es/m/559291.1608587013@sss.pgh.pa.us
Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke
the assumption that the startup process waiting for the recovery conflict
on buffer pin should be waken up only by buffer unpin or the timeout enabled
in ResolveRecoveryConflictWithBufferPin(). It caused, for example,
SIGHUP signal handler or walreceiver process to wake that startup process
up unnecessarily frequently.
Additionally, add the comments about why that dedicated latch that
the reverted patch tried to get rid of should not be removed.
Thanks to Kyotaro Horiguchi for the discussion.
Author: Fujii Masao
Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).
Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes. This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)
Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize. Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.
Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one. This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not. We might as well save a few
cycles by standardizing on "not".
Also improve the documentation for hash_create().
In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.
Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
When committing a transaction that dropped a relation, we previously
truncated only the first segment file to free up disk space (the one
that won't be unlinked until the next checkpoint).
Truncate higher numbered segments too, even though we unlink them on
commit. This frees the disk space immediately, even if other backends
have open file descriptors and might take a long time to get around to
handling shared invalidation events and closing them. Also extend the
same behavior to the first segment, in recovery.
Back-patch to all supported releases.
Bug: #16663
Reported-by: Denis Patron <denis.patron@previnet.it>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com>
Reviewed-by: David Zhang <david.zhang@highgo.ca>
Discussion: https://postgr.es/m/16663-fe97ccf9932fc800%40postgresql.org
Reverts 27838981be (some comments are kept). Per discussion, it does
not seem safe to relax the lock level used for this; in order for it to
be safe, there would have to be memory barriers between the point we set
the flag and the point we set the trasaction Xid, which perhaps would
not be so bad; but there would also have to be barriers at the readers'
side, which from a performance perspective might be bad.
Now maybe this analysis is wrong and it *is* safe for some reason, but
proof of that is not trivial.
Discussion: https://postgr.es/m/20201118190928.vnztes7c2sldu43a@alap3.anarazel.de
While ereport() and elog() themselves are quite cheap when the
error message level is too low to be printed, some places need to do
substantial work before they can call those macros at all. To allow
optimizing away such setup work when nothing is to be printed, make
elog.c export a new function message_level_is_interesting(elevel)
that reports whether ereport/elog will do anything. Make use of that
in various places that had ad-hoc direct tests of log_min_messages etc.
Also teach ProcSleep to use it to avoid some work. (There may well
be other places that could usefully use this; I didn't search hard.)
Within elog.c, refactor a little bit to avoid having duplicate copies
of the policy-setting logic. When that code was written, we weren't
relying on the availability of inline functions; so it had some
duplications in the name of efficiency, which I got rid of.
Alvaro Herrera and Tom Lane
Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
While cancelling an autovacuum worker, we hold ProcArrayLock while
formatting a debugging log string. We can make this shorter by saving
the data we need to produce the message and doing the formatting outside
the locked region.
This isn't terribly critical, as it only occurs pretty rarely: when a
backend runs deadlock detection and it happens to be blocked by a
autovacuum running autovacuum. Still, there's no need to cause a hiccup
in ProcArrayLock processing, which can be very high-traffic in some
cases.
While at it, rework code so that we only print the string when it is
really going to be used, as suggested by Michael Paquier.
Discussion: https://postgr.es/m/20201118214127.GA3179@alvherre.pgsql
Reviewed-by: Michael Paquier <michael@paquier.xyz>
We don't actually need a lock to set PGPROC->statusFlags itself; what we
do need is a shared lock on either XidGenLock or ProcArrayLock in order to
ensure MyProc->pgxactoff keeps still while we modify the mirror array in
ProcGlobal->statusFlags. Some places were using an exclusive lock for
that, which is excessive. Relax those to use shared lock only.
procarray.c has a couple of places with somewhat brittle assumptions
about PGPROC changes: ProcArrayEndTransaction uses only shared lock, so
it's permissible to change MyProc only. On the other hand,
ProcArrayEndTransactionInternal also changes other procs, so it must
hold exclusive lock. Add asserts to ensure those assumptions continue
to hold.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20201117155501.GA13805@alvherre.pgsql
Otherwise, if FDDEBUG is enabled, the debugging output fails because
it tries to read the fileName, which isn't set up yet (and should in
fact always be NULL).
AFAICT, this has been wrong since Berkeley. Before 96bf88d52,
it would accidentally fail to crash on platforms where snprintf()
is forgiving about being passed a NULL pointer for %s; but the
file name intended to be included in the debug output wouldn't
ever have shown up.
Report and fix by Greg Nancarrow. Although this is only visibly
broken in custom-made builds, it still seems worth back-patching
to all supported branches, as the FDDEBUG code is pretty useless
as it stands.
Discussion: https://postgr.es/m/CAJcOf-cUDgm9qYtC_B6XrC6MktMPNRby2p61EtSGZKnfotMArw@mail.gmail.com
When ComputeXidHorizons() was called before MyDatabaseOid is set,
e.g. because a dead row in a shared relation is encountered during
InitPostgres(), the horizon for normal tables was computed too
aggressively, ignoring all backends connected to a database.
During subsequent pruning in a data table the too aggressive horizon
could end up still being used, possibly leading to still needed tuples
being removed. Not good.
This is a bug in dc7420c2c9, which the test added in 94bc27b576 made
visible, if run with force_parallel_mode set to regress. In that case
the bug is reliably triggered, because "pruning_query" is run in a
parallel worker and the start of that parallel worker is likely to
encounter a dead row in pg_database.
The fix is trivial: Compute a more pessimistic data table horizon if
MyDatabaseId is not yet known.
Author: Andres Freund
Discussion: https://postgr.es/m/20201029040030.p4osrmaywhqaesd4@alap3.anarazel.de
This fixes a bug in the edge case where, for a temp table, heap_page_prune()
can end up with a different horizon than heap_vacuum_rel(). Which can trigger
errors like "ERROR: cannot freeze committed xmax ...".
The bug was introduced due to interaction of a7212be8b9 "Set cutoff xmin more
aggressively when vacuuming a temporary table." with dc7420c2c9 "snapshot
scalability: Don't compute global horizons while building snapshots.".
The problem is caused by lazy_scan_heap() assuming that the only reason its
HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is
a HOT tuple, or if the tuple's inserting transaction has aborted since the
heap_page_prune() call. But after a7212be8b9 that was also possible in other
cases for temp tables, because heap_page_prune() uses a different visibility
test after dc7420c2c9.
The fix is fairly simple: Move the special case logic for temp tables from
vacuum_set_xid_limits() to the infrastructure introduced in dc7420c2c9. That
ensures that the horizon used for pruning is at least as aggressive as the one
used by lazy_scan_heap(). The concrete horizon used for temp tables is
slightly different than the logic in dc7420c2c9, but should always be as
aggressive as before (see comments).
A significant benefit to centralizing the logic procarray.c is that now the
more aggressive horizons for temp tables does not just apply to VACUUM but
also to e.g. HOT pruning and the nbtree killtuples logic.
Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I
undid the the related changes from a7212be8b9.
This commit also adds an isolation test ensuring that the more aggressive
vacuuming and pruning of temp tables keeps working.
Debugged-By: Amit Kapila <amit.kapila16@gmail.com>
Debugged-By: Tom Lane <tgl@sss.pgh.pa.us>
Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de
CheckBuffer() is designed to be a concurrent-safe function able to run
sanity checks on a relation page without loading it into the shared
buffers. The operation is done using a lock on the partition involved
in the shared buffer mapping hashtable and an I/O lock for the buffer
itself, preventing the risk of false positives due to any concurrent
activity.
The primary use of this function is the detection of on-disk corruptions
for relation pages. If a page is found in shared buffers, the on-disk
page is checked if not dirty (a follow-up checkpoint would flush a valid
version of the page if dirty anyway), as it could be possible that a
page was present for a long time in shared buffers with its on-disk
version corrupted. Such a scenario could lead to a corrupted cluster if
a host is plugged off for example. If the page is not found in shared
buffers, its on-disk state is checked. PageIsVerifiedExtended() is used
to apply the same sanity checks as when a page gets loaded into shared
buffers.
This function will be used by an upcoming patch able to check the state
of on-disk relation pages using a SQL function.
Author: Julien Rouhaud, Michael Paquier
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com
This is useful for checks of relation pages without having to load the
pages into the shared buffers, and two cases can make use of that: page
verification in base backups and the online, lock-safe, flavor.
Compatibility is kept with past versions using a macro that calls the
new extended routine with the set of options compatible with the
original version.
Extracted from a larger patch by the same author.
Author: Anastasia Lubennikova
Reviewed-by: Michael Paquier, Julien Rouhaud
Discussion: https://postgr.es/m/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
While registering for postmaster exit events, we have to handle a couple
of edge cases where the postmaster is already gone. Commit 815c2f09
missed one: EACCES must surely imply that PostmasterPid no longer
belongs to our postmaster process (or alternatively an unexpected
permissions model has been imposed on us). Like ESRCH, this should be
treated as a WL_POSTMASTER_DEATH event, rather than being raised with
ereport().
No known problems reported in the wild. Per code review from Tom Lane.
Back-patch to 13.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us
If WaitEventSetWait() reports that the postmaster has gone away, later
calls to WaitEventSetWait() should continue to report that. Otherwise
further waits that occur in the proc_exit() path after we already
noticed the postmaster's demise could block forever.
Back-patch to 13, where the kqueue support landed.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us
I (Andres) broke this in 623a9CA79bx, because I didn't think about the
way snapshots are built on standbys sufficiently. Unfortunately our
existing tests did not catch this, as they are all just querying with
psql (therefore ending up with fresh snapshots).
The fix is trivial, we just need to increment the transaction
completion counter in ExpireTreeKnownAssignedTransactionIds(), which
is the equivalent of ProcArrayEndTransaction() during recovery.
This commit also adds a new test doing some basic testing of the
correctness of snapshots built on standbys. To avoid the
aforementioned issue of one-shot psql's not exercising the snapshot
caching, the test uses a long lived psqls, similar to
013_crash_restart.pl. It'd be good to extend the test further.
Reported-By: Ian Barwick <ian.barwick@2ndquadrant.com>
Author: Andres Freund <andres@anarazel.de>
Author: Ian Barwick <ian.barwick@2ndquadrant.com>
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb18f@2ndquadrant.com
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Rearrange things so that data collected in CheckpointStats
includes SLRU activity.
Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them. (I'm not sure if they were ever needed, but
they aren't now.)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
This function could often be seen in profiles of vacuum and could often
be a significant bottleneck during recovery. The problem was that a qsort
was performed in order to sort an array of item pointers in reverse offset
order so that we could use that to safely move tuples up to the end of the
page without overwriting the memory of yet-to-be-moved tuples. i.e. we
used to compact the page starting at the back of the page and move towards
the front. The qsort that this required could be expensive for pages with
a large number of tuples.
In this commit, we take another approach to tuple compactification.
Now, instead of sorting the remaining item pointers array we first check
if the array is presorted and only memmove() the tuples that need to be
moved. This presorted check can be done very cheaply in the calling
functions when the array is being populated. This presorted case is very
fast.
When the item pointer array is not presorted we must copy tuples that need
to be moved into a temp buffer before copying them back into the page
again. This differs from what we used to do here as we're now copying the
tuples back into the page in reverse line pointer order. Previously we
left the existing order alone. Reordering the tuples results in an
increased likelihood of hitting the pre-sorted case the next time around.
Any newly added tuple which consumes a new line pointer will also maintain
the correct sort order of tuples in the page which will also result in the
presorted case being hit the next time. Only consuming an unused line
pointer can cause the order of tuples to go out again, but that will be
corrected next time the function is called for the page.
Benchmarks have shown that the non-presorted case is at least equally as
fast as the original qsort method even when the page just has a few
tuples. As the number of tuples becomes larger the new method maintains
its performance whereas the original qsort method became much slower when
the number of tuples on the page became large.
Author: David Rowley
Reviewed-by: Thomas Munro
Tested-by: Jakub Wartak
Discussion: https://postgr.es/m/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com
Although 58c6feccf fixed the case for SIGQUIT, we were still calling
proc_exit() from signal handlers for SIGTERM and timeout failures in
ProcessStartupPacket. Fortunately, at the point where that code runs,
we haven't yet connected to shared memory in any meaningful way, so
there is nothing we need to undo in shared memory. This means it
should be safe to use _exit(1) here, ie, not run any atexit handlers
but also inform the postmaster that it's not a crash exit.
To make sure nobody breaks the "nothing to undo" expectation, add
a cross-check that no on-shmem-exit or before-shmem-exit handlers
have been registered yet when we finish using these signal handlers.
This change is simple enough that maybe it could be back-patched,
but I won't risk that right now.
Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us
Historically, cancel_before_shmem_exit() just silently did nothing
if the specified callback wasn't the top-of-stack. The folly of
ignoring this case was exposed by the bugs fixed in 303640199 and
bab150045, so let's make it throw elog(ERROR) instead.
There is a decent argument to be made that PG_ENSURE_ERROR_CLEANUP
should use some separate infrastructure, so it wouldn't break if
something inside the guarded code decides to register a new
before_shmem_exit callback. However, a survey of the surviving
uses of before_shmem_exit() and PG_ENSURE_ERROR_CLEANUP doesn't
show any plausible conflicts of that sort today, so for now we'll
forgo the extra complexity. (It will almost certainly become
necessary if anyone ever wants to wrap PG_ENSURE_ERROR_CLEANUP
around arbitrary user-defined actions, though.)
No backpatch, since this is developer support not a production issue.
Bharath Rupireddy, per advice from Andres Freund, Robert Haas, and myself
Discussion: https://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com