After commit cc2c7d65fc added this flag,
failure to reset it caused assertion failures. In non-assert builds, it
made the system fail to achieve the objectives listed in that commit;
chiefly, we might emit a spurious log message. Back-patch to v15, where
that commit first appeared.
Bharath Rupireddy and Kyotaro Horiguchi. Reviewed by Dilip Kumar,
Nathan Bossart and Michael Paquier. Reported by Dilip Kumar.
This commit has been applied as of b4f584f9d2 in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me. Tests have been
conducted by the both of us.
Discussion: https://postgr.es/m/CAFiTN-sE3ry=ycMPVtC+Djw4Fd7gbUGVv_qqw6qfzp=JLvqT3g@mail.gmail.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
The previous commit addressed the chief consequences of a race condition
between InstallXLogFileSegment() and KeepFileRestoredFromArchive(). Fix
three lesser consequences. A spurious durable_rename_excl() LOG message
remained possible. KeepFileRestoredFromArchive() wasted the proceeds of
WAL recycling and preallocation. Finally, XLogFileInitInternal() could
return a descriptor for a file that KeepFileRestoredFromArchive() had
already unlinked. That felt like a recipe for future bugs.
This commit has been applied as of cc2c7d65fc in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me, Noah Misch is the
author. Tests have been conducted by the both of us.
Note that this commit is known to have introduced a regression of its
own. This is fixed by the commit following this one, and not grouped in
a single commit to keep the commit history consistent across all
branches.
Reported-by: Arun Thirupathi
Author: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
Before a restartpoint finishes PreallocXlogFiles(), a startup process
KeepFileRestoredFromArchive() call can unlink the preallocated segment.
If a CHECKPOINT sql command had elicited the restartpoint experiencing
the race condition, that sql command failed. Moreover, the restartpoint
omitted its log_checkpoints message and some inessential resource
reclamation. Prevent the ERROR by skipping open() of the segment.
Since these consequences are so minor, no back-patch.
This commit has been applied as of 2b3e4672f7 in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me, Noah Misch is the
author. Tests have been conducted by the both of us.
Reported-by: Arun Thirupathi
Author: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
This reverts commit 8ad6c5dbbe, which was a commit specific to v14 and
older branches as the race condition between restartpoints and
KeepFileRestoredFromArchive() still existed.
1f95181b44 has worsened the situation on these two branches, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The same logic as v15 and
newer versions will be applied in some follow-up commits to close this
problem, making this HINT not necessary anymore.
Reported-by: Arun Thirupathi
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
Only initdb used it. initdb refuses to operate on a non-empty directory
and generally does not cope with pre-existing files of other kinds.
Hence, use the opportunity to simplify.
This commit has been applied as of 421484f79c in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me, Noah Misch is the
author. Tests have been conducted by the both of us.
Reported-by: Arun Thirupathi
Author: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
Infrequently, the mismatch caused log_checkpoints messages and
TRACE_POSTGRESQL_CHECKPOINT_DONE() to witness an "added" count too high
by one. Since that consequence is so minor, no back-patch.
This commit has been applied as of 85656bc305 in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me, Noah Misch is the
author. Tests have been conducted by the both of us.
Reported-by: Arun Thirupathi
Author: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
Cold paths, initdb and end-of-recovery, used it. Don't optimize them.
This commit has been applied as of c53c6b98d3 in v15 and newer
versions. This is required on stable branches of v13 and v14 to fix a
regression reported by Noah Misch, introduced by 1f95181b44, causing
spurious failures in archive recovery (neither streaming nor archive
recovery) with concurrent restartpoints. The backpatched versions of
the patches have been aligned on these branches by me, Noah Misch is the
author. Tests have been conducted by the both of us.
Reported-by: Arun Thirupathi
Author: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20210202151416.GB3304930@rfd.leadboat.com
Discussion: https://postgr.es/m/20250306193013.36.nmisch@google.com
Backpatch-through: 13
XLogPageRead() checks immediately for an invalid WAL record header on a
standby, to be able to handle the case of continuation records that need
to be read across two different sources. As written, the check was too
generic, applying to any target LSN. Based on an analysis by Kyotaro
Horiguchi, what really matters is to make sure that the page header is
checked when attempting to read a LSN at the boundary of a segment, to
handle the case of a continuation record that spawns across multiple
pages when dealing with multiple segments, as WAL receivers are spawned
they request WAL from the beginning of a segment. This fix has been
proposed by Kyotaro Horiguchi.
This could cause standbys to loop infinitely when dealing with a
continuation record during a timeline jump, in the case where the
contents of the record in the follow-up page are invalid.
Some regression tests are added to check such scenarios, able to
reproduce the original problem. In the test, the contents of a
continuation record are overwritten with junk zeros on its follow-up
page, and replayed on standbys. This is inspired by 039_end_of_wal.pl,
and is enough to show how standbys should react on promotion by not
being stuck. Without the fix, the test would fail with a timeout. The
test to reproduce the problem has been written by Alexander Kukushkin.
The original check has been introduced in 0668719801, for a similar
problem.
Author: Kyotaro Horiguchi, Alexander Kukushkin
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com
Backpatch-through: 13
This is a cherry pick of 4c8e00ae from the 14 branch into the 13 branch.
It avoids an assertion failure when ForwardSyncRequest() tries to
allocate memory while trying to compact the queue, if run in a critical
section. RelationTruncate() gained a critical section in 38c579b0, and
could fail in that way in the 13 branch.
b1ffe3ff originally fixed the same problem with TruncateMultiXact(), but
for that case it only needed to go back as far as 14, where SLRUs
started using the sync request queue. It also fixed a related deadlock
risk that doesn't apply in this case (this case doesn't wait), but it
might exist in theory and it doesn't hurt to keep the code the same as
later branches.
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> (original commit)
Reviewed-by: Michael Paquier <michael@paquier.xyz> (in this new context)
Reported-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Discussion: https://postgr.es/m/f98aaa79-80b5-47c9-832a-31416a3a528b%40postgrespro.ru
durable_rename_excl() attempts to avoid overwriting any existing files
by using link() and unlink(), and it falls back to rename() on some
platforms (aka WIN32), which offers no such overwrite protection. Most
callers use durable_rename_excl() just in case there is an existing
file, but in practice there shouldn't be one (see below for more
details).
Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file. As per Nathan's tests, it is
possible to end up with two links to the same file in pg_wal after a
crash just before unlink() during WAL recycling. Specifically, the test
produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting,
leading to WAL corruption.
This change replaces all the calls of durable_rename_excl() to
durable_rename(). This removes the protection against accidentally
overwriting an existing file, but some platforms are already living
without it and ordinarily there shouldn't be one. The function itself
is left around in case any extensions are using it. It will be removed
on HEAD via a follow-up commit.
Here is a summary of the existing callers of durable_rename_excl() (see
second discussion link at the bottom), replaced by this commit. First,
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms. Second, xlog.c uses it to recycle
past WAL segments, where an overwrite should not happen (origin of the
change at f0e37a8) because there are protections about the WAL segment
to select when recycling an entry. The third and last area is related
to the write of timeline history files. writeTimeLineHistory() will
write a new timeline history file at the end of recovery on promotion,
so there should be no such files for the same timeline.
What remains is writeTimeLineHistoryFile(), that can be used in parallel
by a WAL receiver and the startup process, and some digging of the
buildfarm shows that EEXIST from a WAL receiver can happen with an error
of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\",
which would cause an automatic restart of the WAL receiver as it is
promoted to FATAL, hence this should improve the stability of the WAL
receiver as rename() would overwrite an existing TLI history file
already fetched by the startup process at recovery.
This is the second time this change is attempted, ccfbd92 being the
first one, but this time no assertions are added for the case of a TLI
history file written concurrently by the WAL receiver or the startup
process because we can expect one to exist (some of the TAP tests are
able to trigger with a proper timing).
This commit has been originally applied on v16~ as of dac1ff3090, and
we have received more reports of this issue, where clusters can become
corrupted at replay in older stable branches with multiple links
pointing to the same physical WAL segment file. This backpatch
addresses the problem for the v13~v15 range.
Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13
Discussion: https://postgr.es/m/Ym6GZbqQdlalSKSG@paquier.xyz
Discussion: https://postgr.es/m/CAJhEC04tBkYPF4q2uS_rCytauvNEVqdBAzasBEokfceFhF=KDQ@mail.gmail.com
An inplace update's invalidation messages are part of its transaction's
commit record. However, the update survives even if its transaction
aborts or we stop recovery before replaying its transaction commit.
After recovery, a backend that started in recovery could update the row
without incorporating the inplace update. That could result in a table
with an index, yet relhasindex=f. That is a source of index corruption.
This bulk invalidation avoids the functional consequences. A future
change can fix the !RecoveryInProgress() scenario without changing the
WAL format. Back-patch to v17 - v12 (all supported versions). v18 will
instead add invalidations to WAL.
Discussion: https://postgr.es/m/20240618152349.7f.nmisch@google.com
We did not recover the subtransaction IDs of prepared transactions
when starting a hot standby from a shutdown checkpoint. As a result,
such subtransactions were considered as aborted, rather than
in-progress. That would lead to hint bits being set incorrectly, and
the subtransactions suddenly becoming visible to old snapshots when
the prepared transaction was committed.
To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.
Backport to all supported versions.
Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
Commit 0668719801 changed XLogPageRead() so that it validated the page
header, if invalid page header was found reset the error message and
retried reading the page, to fix the scenario where streaming standby
got stuck at a continuation record. This change hid the error message
about invalid page header, which would make it harder for users to
investigate what the actual issue was found in WAL.
To fix the issue, this commit makes XLogPageRead() report the error
message when invalid page header is found.
When not in standby mode, an invalid page header should cause recovery
to end, not retry reading the page, so XLogPageRead() doesn't need to
validate the page header for the retry. Instead, ReadPageInternal() should
be responsible for the validation in that case. Therefore this commit
changes XLogPageRead() so that if not in standby mode it doesn't validate
the page header for the retry.
This commit has been originally pushed as of 68601985e6 for 15 and
newer versions, but not to the older branches. A recent investigation
related to WAL replay failures has showed up that the lack of this patch
in 12~14 is an issue, as we want to be able to improve the WAL reader to
make a correct distinction between the end-of-wal and OOM cases when
validating record headers. REL_11_STABLE is left out as it will be
EOL'd soon.
Reported-by: Yugo Nagata
Author: Yugo Nagata, Kyotaro Horiguchi
Reviewed-by: Ranier Vilela, Fujii Masao
Discussion: https://postgr.es/m/20210718045505.32f463ed6c227111038d8ae4@sraoss.co.jp
Discussion: https://postgr.es/m/17928-aa92416a70ff44a2@postgresql.org
Backpatch-through: 12
This is a follow-up of f663b00, that has been committed to v13 and v14,
tweaking the TAP test for two-phase transactions so as it provides
coverage for the bug that has been fixed. This change is done in its
own commit for clarity, as v15 and HEAD did not show the problematic
behavior, still missed coverage for it.
While on it, this adds a comment about the dependency of the last
partial segment rename and RecoverPreparedTransactions() at the end of
recovery, as that can be easy to miss.
Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at
Backpatch-through: 13
When archiving is enabled, a promotion request would fail with the
following error when some 2PC transaction needs to be recovered from
WAL, preventing the promotion to complete:
FATAL: requested WAL segment pg_wal/000000010000000000000001 has already been removed
The origin of the problem is that the last partial segment of the old
timeline is renamed before recovering the 2PC data via
RecoverPreparedTransactions() at the end of recovery, causing the FATAL
because the segment wanted is now renamed with a .partial suffix. This
commit reorders a bit the end-of-recovery actions so as the execution of
recovery_end_command, the cleanup of the old segments of the old
timeline (RemoveNonParentXlogFiles) and the last partial segment rename
are done after the 2PC transaction data is recovered with
RecoverPreparedTransactions(). This makes the order of these
end-of-recovery actions more consistent with ~15, at the exception of
the end-of-recovery checkpoint that still needs to happen before all the
actions reordered here in v13 and v14, contrary to what 15~ does.
v15 and newer versions have "fixed" this problem somewhat accidentally
with 811051c, where the end-of-recovery actions got reordered. In this
case, the recovery of 2PC transactions happens before the renaming of
the last partial segment of the old timeline.
v13 and v14 are the versions that can easily see this problem as per the
refactoring of 38a95731 where XLogReaderState is reset in
XLogBeginRead() before reading the 2PC transaction data. v11 and v12
could also see this problem, but may finish by reading the 2PC data from
some of the WAL buffers instead. Perhaps something could be done for
these two branches, but I am not really excited about doing something on
these per the lack of complaints and per the fact that v11 is soon going
to be EOL'd soon (there is always a risk of breaking something).
Note that the TAP test 009_twophase.pl is able to exhibit the issue if
it enables archiving on the primary node, which does not impact the test
coverage as restore_command would remain unused. This is something that
should be changed on v15 and HEAD as well, so this will be changed in a
separate commit for clarity.
Author: Julian Markwort
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at
Backpatch-through: 13
The call to XLogGetReplicationSlotMinimumLSN() might return a
greater LSN than the one given to the function. Subsequent segment
number calculations might then underflow, which could result in
unexpected behavior when removing or recyling WAL files. This was
introduced with max_slot_wal_keep_size in c655077639. To fix, skip
the block of code for replication slots if the LSN is greater.
Reported-by: Xu Xingwang
Author: Kyotaro Horiguchi
Reviewed-by: Junwang Zhao
Discussion: https://postgr.es/m/17903-4288d439dee856c6%40postgresql.org
Backpatch-through: 13
When ending recovery based on recovery_target_xid matching with
recovery_target_inclusive = off, we printed an incorrect timestamp
(always 2000-01-01) in the "recovery stopping before ... transaction"
log message. This is a consequence of sloppy refactoring in
c945af80c: the code to fetch recordXtime out of the commit/abort
record used to be executed unconditionally, but it was changed
to get called only in the RECOVERY_TARGET_TIME case. We need only
flip the order of operations to restore the intended behavior.
Per report from Torsten Förtsch. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/CAKkG4_kUevPqbmyOfLajx7opAQk6Cvwkvx0HRcFjSPfRPTXanA@mail.gmail.com
Previously, we'd compress only when the active range of array entries
reached Max(4 * PROCARRAY_MAXPROCS, 2 * pArray->numKnownAssignedXids).
If max_connections is large, the first term could result in not
compressing for a long time, resulting in much wastage of cycles in
hot-standby backends scanning the array to take snapshots. Get rid
of that term, and just bound it to 2 * pArray->numKnownAssignedXids.
That however creates the opposite risk, that we might spend too much
effort compressing. Hence, consider compressing only once every 128
commit records. (This frequency was chosen by benchmarking. While
we only tried one benchmark scenario, the results seem stable over
a fairly wide range of frequencies.)
Also, force compression when processing RecoveryInfo WAL records
(which should be infrequent); the old code could perform compression
then, but would do so only after the same array-range check as for
the transaction-commit path.
Also, opportunistically run compression if the startup process is about
to wait for WAL, though not oftener than once a second. This should
prevent cases where we waste lots of time by leaving the array
not-compressed for long intervals due to low WAL traffic.
Lastly, add a simple check to keep us from uselessly compressing
when the array storage is already compact.
Back-patch, as the performance problem is worse in pre-v14 branches
than in HEAD.
Simon Riggs and Michail Nikolaev, with help from Tom Lane and
Andres Freund.
Discussion: https://postgr.es/m/CALdSSPgahNUD_=pB_j=1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw@mail.gmail.com
clang 15+ will issue a set-but-not-used warning when the only
use of a variable is in autoincrements (e.g., "foo++;").
That's perfectly sensible, but it detects a few more cases that
we'd not noticed before. Silence the warnings with our usual
methods, such as PG_USED_FOR_ASSERTS_ONLY, or in one case by
actually removing a useless variable.
One thing that we can't nicely get rid of is that with %pure-parser,
Bison emits "yynerrs" as a local variable that falls foul of this
warning. To silence those, I inserted "(void) yynerrs;" in the
top-level productions of affected grammars.
Per recently-established project policy, this is a candidate
for back-patching into out-of-support branches: it suppresses
annoying compiler warnings but changes no behavior. Hence,
back-patch to 9.5, which is as far as these patches go without
issues. (A preliminary check shows that the prior branches
need some other set-but-not-used cleanups too, so I'll leave
them for another day.)
Discussion: https://postgr.es/m/514615.1663615243@sss.pgh.pa.us
When a PostgreSQL instance performing archive recovery but not using
standby mode is promoted, and the last WAL segment that it attempted
to read ended in a partial record, the previous code would create
invalid WAL on the new timeline. The WAL from the previously timeline
would be copied to the new timeline up until the end of the last valid
record, but instead of beginning to write WAL at immediately
afterwards, the promoted server would write an overwrite contrecord at
the beginning of the next segment. The end of the previous segment
would be left as all-zeroes, resulting in failures if anything tried
to read WAL from that file.
The root of the issue is that ReadRecord() decides whether to set
abortedRecPtr and missingContrecPtr based on the value of StandbyMode,
but ReadRecord() switches to a new timeline based on the value of
ArchiveRecoveryRequested. We shouldn't try to write an overwrite
contrecord if we're switching to a new timeline, so change the test in
ReadRecod() to check ArchiveRecoveryRequested instead.
Code fix by Dilip Kumar. Comments by me incorporating suggested
language from Álvaro Herrera. Further review from Kyotaro Horiguchi
and Sami Imseih.
Discussion: http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
Discussion: http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
Crash recovery on standby may encounter missing directories
when replaying database-creation WAL records. Prior to this
patch, the standby would fail to recover in such a case;
however, the directories could be legitimately missing.
Consider the following sequence of commands:
CREATE DATABASE
DROP DATABASE
DROP TABLESPACE
If, after replaying the last WAL record and removing the
tablespace directory, the standby crashes and has to replay the
create database record again, crash recovery must be able to continue.
A fix for this problem was already attempted in 49d9cfc68b, but it
was reverted because of design issues. This new version is based
on Robert Haas' proposal: any missing tablespaces are created
during recovery before reaching consistency. Tablespaces
are created as real directories, and should be deleted
by later replay. CheckRecoveryConsistency ensures
they have disappeared.
The problems detected by this new code are reported as PANIC,
except when allow_in_place_tablespaces is set to ON, in which
case they are WARNING. Apart from making tests possible, this
gives users an escape hatch in case things don't go as planned.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Author: Paul Guo <paulguo@gmail.com>
Reviewed-by: Anastasia Lubennikova <lubennikovaav@gmail.com> (older versions)
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> (older versions)
Reviewed-by: Michaël Paquier <michael@paquier.xyz>
Diagnosed-by: Paul Guo <paulguo@gmail.com>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
This is a backpatch to branches 10-14 of the following commits:
7170f2159f Allow "in place" tablespaces.
c6f2f01611 Fix pg_basebackup with in-place tablespaces.
f6f0db4d62 Fix pg_tablespace_location() with in-place tablespaces
7a7cd84893 doc: Remove mention to in-place tablespaces for pg_tablespace_location()
5344723755 Remove unnecessary Windows-specific basebackup code.
In-place tablespaces were introduced as a testing helper mechanism, but
they are going to be used for a bugfix in WAL replay to be backpatched
to all stable branches.
I (Álvaro) had to adjust some code to account for lack of
get_dirent_type() in branches prior to 14.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Michaël Paquier <michael@paquier.xyz>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20220722081858.omhn2in5zt3g4nek@alvherre.pgsql
When a non-exclusive backup is canceled, do_pg_abort_backup() is called
and resets some variables set by pg_backup_start (pg_start_backup in v14
or before). But previously it forgot to reset the session state indicating
whether a non-exclusive backup is in progress or not in this session.
This issue could cause an assertion failure when the session running
BASE_BACKUP is terminated after it executed pg_backup_start and
pg_backup_stop (pg_stop_backup in v14 or before). Also it could cause
a segmentation fault when pg_backup_stop is called after BASE_BACKUP
in the same session is canceled.
This commit fixes the issue by making do_pg_abort_backup reset
that session state.
Back-patch to all supported branches.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Masahiko Sawada, Michael Paquier, Robert Haas
Discussion: https://postgr.es/m/3374718f-9fbf-a950-6d66-d973e027f44c@oss.nttdata.com
If a cluster is promoted (aka the control file shows a state different
than DB_IN_ARCHIVE_RECOVERY) while CreateRestartPoint() is still
processing, this function could miss an update of the control file for
"checkPoint" and "checkPointCopy" but still do the recycling and/or
removal of the past WAL segments, assuming that the to-be-updated LSN
values should be used as reference points for the cleanup. This causes
a follow-up restart attempting crash recovery to fail with a PANIC on a
missing checkpoint record if the end-of-recovery checkpoint triggered by
the promotion did not complete while the cluster abruptly stopped or
crashed before the completion of this checkpoint. The PANIC would be
caused by the redo LSN referred in the control file as located in a
segment already gone, recycled by the previous restartpoint with
"checkPoint" out-of-sync in the control file.
This commit fixes the update of the control file during restartpoints so
as "checkPoint" and "checkPointCopy" are updated even if the cluster has
been promoted while a restartpoint is running, to be on par with the set
of WAL segments actually recycled in the end of CreateRestartPoint().
7863ee4 has fixed this problem already on master, but the release timing
of the latest point versions did not let me enough time to study and fix
that on all the stable branches.
Reported-by: Fujii Masao, Rui Zhao
Author: Kyotaro Horiguchi
Reviewed-by: Nathan Bossart, Michael Paquier
Discussion: https://postgr.es/m/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com
Backpatch-through: 10
The back-patch of commit bbace5697d had
the unfortunate effect of changing the layout of PGPROC in the
back-branches, which could break extensions. This happened because it
changed the delayChkpt from type bool to type int. So, change it back,
and add a new bool delayChkptEnd field instead. The new field should
fall within what used to be padding space within the struct, and so
hopefully won't cause any extensions to break.
Per report from Markus Wanner and discussion with Tom Lane and others.
Patch originally by me, somewhat revised by Markus Wanner per a
suggestion from Michael Paquier. A very similar patch was developed
by Kyotaro Horiguchi, but I failed to see the email in which that was
posted before writing one of my own.
Discussion: http://postgr.es/m/CA+Tgmoao-kUD9c5nG5sub3F7tbo39+cdr8jKaOVEs_1aBWcJ3Q@mail.gmail.com
Discussion: http://postgr.es/m/20220406.164521.17171257901083417.horikyota.ntt@gmail.com
Crash recovery on standby may encounter missing directories when
replaying create database WAL records. Prior to this patch, the standby
would fail to recover in such a case. However, the directories could be
legitimately missing. Consider a sequence of WAL records as follows:
CREATE DATABASE
DROP DATABASE
DROP TABLESPACE
If, after replaying the last WAL record and removing the tablespace
directory, the standby crashes and has to replay the create database
record again, the crash recovery must be able to move on.
This patch adds a mechanism similar to invalid-page tracking, to keep a
tally of missing directories during crash recovery. If all the missing
directory references are matched with corresponding drop records at the
end of crash recovery, the standby can safely continue following the
primary.
Backpatch to 13, at least for now. The bug is older, but fixing it in
older branches requires more careful study of the interactions with
commit e6d8069522, which appeared in 13.
A new TAP test file is added to verify the condition. However, because
it depends on commit d6d317dbf6, it can only be added to branch
master. I (Álvaro) manually verified that the code behaves as expected
in branch 14. It's a bit nervous-making to leave the code uncovered by
tests in older branches, but leaving the bug unfixed is even worse.
Also, the main reason this fix took so long is precisely that we
couldn't agree on a good strategy to approach testing for the bug, so
perhaps this is the best we can do.
Diagnosed-by: Paul Guo <paulguo@gmail.com>
Author: Paul Guo <paulguo@gmail.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.
Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.
Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
Invalidate abortedRecPtr and missingContrecPtr after a missing
continuation record is successfully skipped on a standby. This fixes a
PANIC caused when a recently promoted standby attempts to write an
OVERWRITE_RECORD with an LSN of the previously read aborted record.
Backpatch to 10 (all stable versions).
Author: Sami Imseih <simseih@amazon.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/44D259DE-7542-49C4-8A52-2AB01534DCA9@amazon.com
Commands like ALTER TABLE SET TABLESPACE may leave files for the next
checkpoint to clean up. If such files are not removed by the time DROP
TABLESPACE is called, we request a checkpoint so that they are deleted.
However, there is presently a window before checkpoint start where new
unlink requests won't be scheduled until the following checkpoint. This
means that the checkpoint forced by DROP TABLESPACE might not remove the
files we expect it to remove, and the following ERROR will be emitted:
ERROR: tablespace "mytblspc" is not empty
To fix, add a call to AbsorbSyncRequests() just before advancing the
unlink cycle counter. This ensures that any unlink requests forwarded
prior to checkpoint start (i.e., when ckpt_started is incremented) will
be processed by the current checkpoint. Since AbsorbSyncRequests()
performs memory allocations, it cannot be called within a critical
section, so we also need to move SyncPreCheckpoint() to before
CreateCheckPoint()'s critical section.
This is an old bug, so back-patch to all supported versions.
Author: Nathan Bossart <nathandbossart@gmail.com>
Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20220215235845.GA2665318%40nathanxps13
Some specific logic is done at the end of recovery when involving 2PC
transactions:
1) Call RecoverPreparedTransactions(), to recover the state of 2PC
transactions into memory (re-acquire locks, etc.).
2) ShutdownRecoveryTransactionEnvironment(), to move back to normal
operations, mainly cleaning up recovery locks and KnownAssignedXids
(including any 2PC transaction tracked previously).
3) Switch XLogCtl->SharedRecoveryState to RECOVERY_STATE_DONE, which is
the tipping point for any process calling RecoveryInProgress() to check
if the cluster is still in recovery or not.
Any snapshot taken between steps 2) and 3) would be empty, causing any
transaction relying on a snapshot at this point to potentially corrupt
data as there could still be some 2PC transactions to track, with
RecentXmin moving backwards on successive calls to GetSnapshotData() in
the same transaction.
As SharedRecoveryState is the point to take into account to know if it
is safe to discard KnownAssignedXids, this commit moves step 2) after
step 3), so as we can never finish with empty snapshots.
This exists since the introduction of hot standby, so backpatch all the
way down. The window with incorrect snapshots is extremely small, but I
have seen it when running 023_pitr_prepared_xact.pl, as did buildfarm
member fairywren. Thomas Munro also found it independently. Special
thanks to Andres Freund for taking the time to analyze this issue.
Reported-by: Thomas Munro, Michael Paquier
Analyzed-by: Andres Freund
Discussion: https://postgr.es/m/20210422203603.fdnh3fu2mmfp2iov@alap3.anarazel.de
Backpatch-through: 9.6
Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.
A fix for this problem was already attempted in commit 515e3d84a0, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.
This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.
Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.
A new TAP test that exercises this is added, but the portability of it
is yet to be seen.
This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files. Instead, as soon as the last WAL page of
the segment is written, the archive status file is created, and the
archiver may process it. If PostgreSQL crashes before it is able to
write and flush the rest of the record (in the next WAL segment), the
wrong version of the first segment file lingers in the archive, which
causes operations such as point-in-time restores to fail.
To fix this, keep track of records that span across segments and ensure
that segments are only marked ready-for-archival once such records have
been completely written to disk.
This has always been wrong, so backpatch all the way back.
Author: Nathan Bossart <bossartn@amazon.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
This commit ensures that the wait interval in the replay delay loop
waiting for an amount of time defined by recovery_min_apply_delay is
correctly handled on reload, recalculating the delay if this GUC value
is updated, based on the timestamp of the commit record being replayed.
The previous behavior would be problematic for example with replay
still waiting even if the delay got reduced or just cancelled. If the
apply delay was increased to a larger value, the wait would have just
respected the old value set, finishing earlier.
Author: Soumyadeep Chakraborty, Ashwin Agrawal
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/CAE-ML+93zfr-HLN8OuxF0BjpWJ17O5dv1eMvSE5jsj9jpnAXZA@mail.gmail.com
Backpatch-through: 9.6
This commit changes the startup process in the standby server so that
it handles the interrupt signals after waiting for wal_retrieve_retry_interval
on the latch and resetting it, before entering another wait on the latch.
This change causes the standby server to promptly handle interrupt signals.
Otherwise, previously, there was the case where the standby needs to
wait extra five seconds to shutdown when the shutdown request arrived
while the startup process was waiting for wal_retrieve_retry_interval
on the latch.
Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty
Reviewed-by: Soumyadeep Chakraborty
Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
Per discussion of BUG #17073, back-patch to all supported versions.
Discussion: https://postgr.es/m/17073-1a5fdaed0fa5d4d0@postgresql.org
When some slots are invalidated due to the max_slot_wal_keep_size limit,
the old segment horizon should move forward to stay within the limit.
However, in commit c655077639 we forgot to call KeepLogSeg again to
recompute the horizon after invalidating replication slots. In cases
where other slots remained, the limits would be recomputed eventually
for other reasons, but if all slots were invalidated, the limits would
not move at all afterwards. Repair.
Backpatch to 13 where the feature was introduced.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Marcin Krupowicz <mk@071.ovh>
Discussion: https://postgr.es/m/17103-004130e8f27782c9@postgresql.org
Since commit c24dcd0cfd, we have been using pg_pread() to read the WAL
file, which doesn't change the seek position (unless we fall back to
the implementation in src/port/pread.c). Update comment accordingly.
Backpatch-through: 12, where we started to use pg_pread()
This only happens if (1) the new standby has no WAL available locally,
(2) the new standby is starting from the old timeline, (3) the promotion
happened in the WAL segment from which the new standby is starting,
(4) the timeline history file for the new timeline is available from
the archive but the WAL files for are not (i.e. this is a race),
(5) the WAL files for the new timeline are available via streaming,
and (6) recovery_target_timeline='latest'.
Commit ee994272ca introduced this
logic and was an improvement over the previous code, but it mishandled
this case. If recovery_target_timeline='latest' and restore_command is
set, validateRecoveryParameters() can change recoveryTargetTLI to be
different from receiveTLI. If streaming is then tried afterward,
expectedTLEs gets initialized with the history of the wrong timeline.
It's supposed to be a list of entries explaining how to get to the
target timeline, but in this case it ends up with a list of entries
explaining how to get to the new standby's original timeline, which
isn't right.
Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi.
Discussion: http://postgr.es/m/CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com
Robert Foggia of Trustwave reported that read_tablespace_map()
fails to prevent an overrun of its on-stack input buffer.
Since the tablespace map file is presumed trustworthy, this does
not seem like an interesting security vulnerability, but still
we should fix it just in the name of robustness.
While here, document that pg_basebackup's --tablespace-mapping option
doesn't work with tar-format output, because it doesn't. To make it
work, we'd have to modify the tablespace_map file within the tarball
sent by the server, which might be possible but I'm not volunteering.
(Less-painful solutions would require changing the basebackup protocol
so that the source server could adjust the map. That's not very
appetizing either.)
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format. This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.
Two of these call sites were in fact wrong:
* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.
* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended. Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units. This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.
There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds. It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.
Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.
Alexey Kondratov and Tom Lane
Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
max_slot_wal_keep_size that was added in v13 and wal_keep_segments are
the GUC parameters to specify how much WAL files to retain for
the standby servers. While max_slot_wal_keep_size accepts the number of
bytes of WAL files, wal_keep_segments accepts the number of WAL files.
This difference of setting units between those similar parameters could
be confusing to users.
To alleviate this situation, this commit renames wal_keep_segments to
wal_keep_size, and make users specify the WAL size in it instead of
the number of WAL files.
There was also the idea to rename max_slot_wal_keep_size to
max_slot_wal_keep_segments, in the discussion. But we have been moving
away from measuring in segments, for example, checkpoint_segments was
replaced by max_wal_size. So we concluded to rename wal_keep_segments
to wal_keep_size.
Back-patch to v13 where max_slot_wal_keep_size was added.
Author: Fujii Masao
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
Remove previous hack in KeepLogSeg that added a case to deal with a
(badly represented) invalid segment number. This was added for the sake
of GetWALAvailability. But it's not needed if in that function we
initialize the segment number to be retreated to the currently being
written segment, so do that instead.
Per valgrind-running buildfarm member skink, and some sparc64 animals.
Discussion: https://postgr.es/m/1724648.1594230917@sss.pgh.pa.us
Since slot_keep_segs indicates the number of WAL segments not LSN,
its datatype should not be XLogRecPtr.
Back-patch to v13 where this issue was added.
Reported-by: Atsushi Torikoshi
Author: Atsushi Torikoshi, tweaked by Fujii Masao
Discussion: https://postgr.es/m/ebd0d674f3e050222238a960cac5251a@oss.nttdata.com