If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple. This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).
This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134 it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.
It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.
Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
Tuple freezing was broken in connection to MultiXactIds; commit
8e53ae025d tried to fix it, but didn't go far enough. As noted by
Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted
update might cause locks in the multi to go ignored by later
transactions. This is because the code depended on a multixact above
their cutoff point not having any lock-only member older than the cutoff
point for Xids, which is easily defeated in READ COMMITTED transactions.
The fix for this involves creating a new MultiXactId when necessary.
But this cannot be done during WAL replay, and moreover multixact
examination requires using CLOG access routines which are not supposed
to be used during WAL replay either; so tuple freezing cannot be done
with the old freeze WAL record. Therefore, separate the freezing
computation from its execution, and change the WAL record to carry all
necessary information. At WAL replay time, it's easy to re-execute
freezing because we don't need to re-compute the new infomask/Xmax
values but just take them from the WAL record.
While at it, restructure the coding to ensure all page changes occur in
a single critical section without much room for failures. The previous
coding wasn't using a critical section, without any explanation as to
why this was acceptable.
In replication scenarios using the 9.3 branch, standby servers must be
upgraded before their master, so that they are prepared to deal with the
new WAL record once the master is upgraded; failure to do so will cause
WAL replay to die with a PANIC message. Later upgrade of the standby
will allow the process to continue where it left off, so there's no
disruption of the data in the standby in any case. Standbys know how to
deal with the old WAL record, so it's okay to keep the master running
the old code for a while.
In master, the old freeze WAL record is gone, for cleanliness' sake;
there's no compatibility concern there.
Backpatch to 9.3, where the original bug was introduced and where the
previous fix was backpatched.
Álvaro Herrera and Andres Freund
The original performs too poorly; in some scenarios it shows way too
high while profiling. Try to make it a bit smarter to avoid excessive
cosst. In particular, make it have a maximum size, and have entries be
sorted in LRU order; once the max size is reached, evict the oldest
entry to avoid it from growing too large.
Per complaint from Andres Freund in connection with new tuple freezing
code.
WAL records of hint bit updates is useful to tools that want to examine
which pages have been modified. In particular, this is required to make
the pg_rewind tool safe (without checksums).
This can also be used to test how much extra WAL-logging would occur if
you enabled checksums, without actually enabling them (which you can't
currently do without re-initdb'ing).
Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with
further changes by me.
Set min_recovery_apply_delay to force a delay in recovery apply for commit and
restore point WAL records. Other records are replayed immediately. Delay is
measured between WAL record time and local standby time.
Robert Haas, Fabrízio de Royes Mello and Simon Riggs
Detailed review by Mitsumasa Kondo
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
When an external recovery command such as restore_command or
archive_cleanup_command fails, report the exit code properly,
distinguishing signals and normal exists, using the existing
wait_result_to_str() facility, instead of just reporting the return
value from system().
Reviewed-by: Peter Geoghegan <pg@heroku.com>
Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid. This means
that a very old Xid could survive hidden within a multi, possibly
outliving its CLOG storage. In the distant future, this would cause
clog lookup failures:
ERROR: could not access status of transaction 3883960912
DETAIL: Could not open file "pg_clog/0E78": No such file or directory.
This mostly was problematic when the updating transaction aborted, since
in that case the row wouldn't get pruned away earlier in vacuum and the
multixact could possibly survive for a long time. In many cases, data
that is inaccessible for this reason way can be brought back
heuristically.
As a second bug, heap_freeze_tuple() didn't properly handle multixacts
that need to be frozen according to cutoff_multi, but whose updater xid
is still alive. Instead of preserving the update Xid, it just set Xmax
invalid, which leads to both old and new tuple versions becoming
visible. This is pretty rare in practice, but a real threat
nonetheless. Existing corrupted rows, unfortunately, cannot be repaired
in an automated fashion.
Existing physical replicas might have already incorrectly frozen tuples
because of different behavior than in master, which might only become
apparent in the future once pg_multixact/ is truncated; it is
recommended that all clones be rebuilt after upgrading.
Following code analysis caused by bug report by J Smith in message
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
and privately by F-Secure.
Backpatch to 9.3, where freezing of MultiXactIds was introduced.
Analysis and patch by Andres Freund, with some tweaks by Álvaro.
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway. Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it. This has made not truncating the content of
pg_multixact/ not defensible anymore.
To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().
At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.
Back-patch to 9.0. 8.4 also has the problem, but since there's no hot
standby there, it's much less pressing. In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required. Add appropriate checks to make sure that's not
happening.
Andres Freund
While autovacuum dutifully launched anti-multixact-wraparound vacuums
when the multixact "age" was reached, the vacuum code was not aware that
it needed to make them be full table vacuums. As the resulting
partial-table vacuums aren't capable of actually increasing relminmxid,
autovacuum continued to launch anti-wraparound vacuums that didn't have
the intended effect, until age of relfrozenxid caused the vacuum to
finally be a full table one via vacuum_freeze_table_age.
To fix, introduce logic for multixacts similar to that for plain
TransactionIds, using the same GUCs.
Backpatch to 9.3, where permanent MultiXactIds were introduced.
Andres Freund, some cleanup by Álvaro
Parts of the code used autovacuum_freeze_max_age to determine whether
anti-multixact-wraparound vacuums are necessary, while others used a
hardcoded 200000000 value. This leads to problems when
autovacuum_freeze_max_age is set to a non-default value. Use the latter
everywhere.
Backpatch to 9.3, where vacuuming of multixacts was introduced.
Andres Freund
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions). That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.
We could still lose control if an asynchronous SIGINT arrives just as the
function is entered. This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction. We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.
Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return. When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely. Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.
These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored. So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues. (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)
Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.
Per reports from Dan Wood.
Back-patch to 9.3 where the new timeout handling infrastructure was
introduced. We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a
transaction block from error (post-9.3) to warning. (Was nothing in <=
9.3.) Also change ABORT outside of a transaction block from notice to
warning.
These bugs can cause data loss on standbys started with hot_standby=on at
the moment they start to accept read only queries, by marking committed
transactions as uncommited. The likelihood of such corruptions is small
unless the primary has a high transaction rate.
5a031a5556 fixed bugs in HS's startup logic
by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state
was reached, missing the fact that both clog and subtrans are written to
before that. This only failed to fail in common cases because the usage
of ExtendCLOG in procarray.c was superflous since clog extensions are
actually WAL logged.
f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing
extensions of pg_subtrans due to the former commit's changes - which are
not WAL logged - by performing the extensions when switching to a state
> STANDBY_INITIALIZED and not performing xid assignments before that -
again missing the fact that ExtendCLOG is unneccessary - but screwed up
twice: Once because latestObservedXid wasn't updated anymore in that
state due to the earlier commit and once by having an off-by-one error in
the loop performing extensions. This means that whenever a
CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed
between the start of the checkpoint recovery started from and the first
xl_running_xact record old transactions commit bits in pg_clog could be
overwritten if they started and committed in that window.
Fix this mess by not performing ExtendCLOG() in HS at all anymore since
it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.
Analysis and patch by Andres Freund. Reported by Christophe Pettus.
Backpatch down to 9.0, like the previous commit that caused this.
RecoveryIsInProgress() can be called very frequently. During normal
operation, it just checks a backend-local variable and returns quickly,
but during hot standby, it checks a spinlock-protected shared variable.
Those spinlock acquisitions can become a point of contention on a busy
hot standby system.
Replace the spinlock acquisition with a memory barrier.
Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.
The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with
values larger than intptr_t, because TYPEALIGN casts the argument to
intptr_t to do the arithmetic. That's not a problem when dealing with
pointers or lengths or offsets related to pointers, but the XLogInsert
scaling patch added a call to MAXALIGN with an XLogRecPtr argument.
To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64,
which are just like the existing variants but work with uint64 instead of
intptr_t.
Report and patch by David Rowley, analysis by Andres Freund.
It seems to make more sense to use "cutoff multixact" terminology
throughout the backend code; "freeze" is associated with replacing of an
Xid with FrozenTransactionId, which is not what we do for MultiXactIds.
Andres Freund
Some adjustments by Álvaro Herrera
This reverts commit 269e780822
and commit 5b571bb8c8.
Unfortunately, the initial patch had insufficient performance testing,
and resulted in a regression.
Per report by Thom Brown.
Performance testing shows that if the insertpos_lck spinlock and the fields
that it protects are on the same cache line with other variables that are
frequently accessed, the false sharing can hurt performance a lot. Keep
them apart by adding some padding.
This keeps the usual trigger file name unchanged from 9.2, avoiding nasty
issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa.
The fallback behavior of creating a full checkpoint before starting up is now
triggered by a file called "fallback_promote". That can be useful for
debugging purposes, but we don't expect any users to have to resort to that
and we might want to remove that in the future, which is why the fallback
mechanism is undocumented.
When upgrading from servers of versions 9.2 and older, and MultiXactIds
have been used in the old server beyond the first page (that is, 2048
multis or more in the default 8kB-page build), pg_upgrade would set the
next multixact offset to use beyond what has been allocated in the new
cluster. This would cause a failure the first time the new cluster
needs to use this value, because the pg_multixact/offsets/ file wouldn't
exist or wouldn't be large enough. To fix, ensure that the transient
server instances launched by pg_upgrade extend the file as necessary.
Per report from Jesse Denardo in
CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
Initialization of the first XLOG buffer at end-of-recovery was broken for
the case that the last read WAL record ended at a page boundary. Instead of
trying to copy the last full xlog page to the buffer cache in that case,
just set shared state so that the next page is initialized when the first
WAL record after startup is inserted. (that's what we did in earlier
version, too)
To make the shared state required for that case less surprising, replace the
XLogCtl->curridx variable, which was the index of the latest initialized
buffer, with an XLogRecPtr of how far the buffers have been initialized.
That also allows us to get rid of the XLogRecEndPtrToBufIdx macro.
While we're at it, make a similar change for XLogCtl->Write.curridx, getting
rid of that variable and calculating the next buffer to write from
XLogCtl->LogwrtResult instead.
Was broken by my xloginsert scaling patch. XLogCtl global variable needs
to be initialized in each process, as it's not inherited by fork() on
Windows.
This patch replaces WALInsertLock with a number of WAL insertion slots,
allowing multiple backends to insert WAL records to the WAL buffers
concurrently. This is particularly useful for parallel loading large amounts
of data on a system with many CPUs.
This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the segment with
zeros. This potentially adds some overhead, but it has been a very common
practice by DBA's to clear the "tail" of the segment with an external
pg_clearxlogtail utility anyway, to make the WAL files compress better.
With this patch, it's no longer necessary to do that.
This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
insertion slots. Performance testing suggests that the default, 8, works
pretty well for all kinds of worklods, but I left the GUC in place to allow
others with different hardware to test that easily. We might want to remove
that before release.
Reviewed by Andres Freund.
On some platforms, posix_fallocate() is available but may still return
EINVAL if the underlying filesystem does not support it. So, in case
of an error, fall through to the alternate implementation that just
writes zeros.
Per buildfarm failure and analysis by Tom Lane.
This function is more efficient than actually writing out zeroes to
the new file, per microbenchmarks by Jon Nelson. Also, it may reduce
the likelihood of WAL file fragmentation.
Jon Nelson, with review by Andres Freund, Greg Smith and me.
In 9.3, there's no particular limit on the number of bgworkers;
instead, we just count up the number that are actually registered,
and use that to set MaxBackends. However, that approach causes
problems for Hot Standby, which needs both MaxBackends and the
size of the lock table to be the same on the standby as on the
master, yet it may not be desirable to run the same bgworkers in
both places. 9.3 handles that by failing to notice the problem,
which will probably work fine in nearly all cases anyway, but is
not theoretically sound.
A further problem with simply counting the number of registered
workers is that new workers can't be registered without a
postmaster restart. This is inconvenient for administrators,
since bouncing the postmaster causes an interruption of service.
Moreover, there are a number of applications for background
processes where, by necessity, the background process must be
started on the fly (e.g. parallel query). While this patch
doesn't actually make it possible to register new background
workers after startup time, it's a necessary prerequisite.
Patch by me. Review by Michael Paquier.
We don't normally bother retrying when the number of bytes written by
write() is short of what was requested. It is generally assumed that a
write() to disk doesn't return short, unless you run out of disk space.
While writing the WAL, however, it seems prudent to try a bit harder,
because a failure leads to PANIC. The write() is also much larger than most
write()s in the backend (up to wal_buffers), so there's more room for
surprises.
Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.
Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
Most of the documentation uses "single-user mode", so use that in the
code as well. Adjust the documentation to match the new error message
wording. Also add a documentation index entry for "single-user mode".
Based-on-patch-by: Jeff Janes <jeff.janes@gmail.com>
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
elog.c has historically treated LOG messages as low-priority during
bootstrap and standalone operation. This has led to confusion and even
masked a bug, because the normal expectation of code authors is that
elog(LOG) will put something into the postmaster log, and that wasn't
happening during initdb. So get rid of the special-case rule and make
the priority order the same as it is in normal operation. To keep from
cluttering initdb's output and the behavior of a standalone backend,
tweak the severity level of three messages routinely issued by xlog.c
during startup and shutdown so that they won't appear in these cases.
Per my proposal back in December.
Since commit f21bb9cfb5, this function
ignores the caller-provided length and loops until it finds a
terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the
previous loop control logic. In passing, revert the addition of an
unused variable by the same commit, presumably a debugging relic.
Seems cleaner to get the currently-replayed TLI in the same call to
GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the
comment what the code does when recovery has already ended
(RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make
resetting ThisTimeLineID afterwards more explicit.
Make slightly better decisions about indentation than what pgindent
is capable of. Mostly breaking out long function calls into one
line per argument, with a few other minor adjustments.
No functional changes- all whitespace.
pgindent ran cleanly (didn't change anything) after.
Passes all regressions.