Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.
This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.
Back-patch to 9.1 where promote signal file was introduced.
Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.
Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
HotStandbyActiveInReplay, introduced in 061b079f, only allowed WAL
replay to happen in the startup process, missing the single user case.
This buglet is fairly harmless as it only causes problems when single
user mode in an assertion enabled build is used to replay a btree vacuum
record.
Backpatch to 9.2. 061b079f was backpatched further, but the assertion
was not.
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk. This could lead to data corruption.
Back-patch to all supported versions.
Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.
This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.
To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.
Backpatch to 9.3 and above, where the fast promotion was introduced.
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.
That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.
That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.
To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.
Backpatch to 9.2 where support for taking basebackups from standbys
was added.
Unlogged relations are reset at the end of crash recovery as they're
only synced to disk during a proper shutdown. Unfortunately that and
later steps can fail, e.g. due to running out of space. This reset
was, up to now performed after marking the database as having finished
crash recovery successfully. As out of space errors trigger a crash
restart that could lead to the situation that not all unlogged
relations are reset.
Once that happend usage of unlogged relations could yield errors like
"could not open file "...": No such file or directory". Luckily
clusters that show the problem can be fixed by performing a immediate
shutdown, and starting the database again.
To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT)
earlier, before marking the database as having successfully recovered.
Discussion: 20140912112246.GA4984@alap3.anarazel.de
Backpatch to 9.1 where unlogged tables were introduced.
Abhijit Menon-Sen and Andres Freund
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.
To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.
In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.
Backpatch to all supported versions; this has been racy since hot standby
was introduced.
Previously the archive recovery always created .ready file for
the last WAL file of the old timeline at the end of recovery even when
it's restored from the archive and has .done file. That is, there was
the case where the WAL file had both .ready and .done files.
This caused the already-archived WAL file to be archived again.
This commit prevents the archive recovery from creating .ready file
for the last WAL file if it has .done file, in order to prevent it from
being archived again.
This bug was added when cascading replication feature was introduced,
i.e., the commit 5286105800.
So, back-patch to 9.2, where cascading replication was added.
Reviewed by Michael Paquier
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.
That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.
This was reported in bug #10675 by Maxim Boguk.
Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.
Backpatch to 9.1 where unlogged tables were introduced.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:
* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)
The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.
Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time. The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.
Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum. Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.
Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.
While at it, update some comments that hadn't been updated since
multixacts were changed.
Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
This was not changed in HEAD, but will be done later as part of a
pgindent run. Future pgindent runs will also do this.
Report by Tom Lane
Backpatch through all supported branches, but not HEAD
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is
allocated a couple of weeks ago. With the default settings, they are both
8k, but they can be changed at compile-time.
CheckRequiredParameterValues() should perform the checks if archive recovery
was requested, even if we are going to perform crash recovery first.
Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive
recovery mode.
When entering crash recovery followed by archive recovery, and the latest
checkpoint is a shutdown checkpoint, and there are no more WAL records to
replay before transitioning from crash to archive recovery, we would not
immediately allow read-only connections in hot standby mode even if we
could. That's because when starting from a shutdown checkpoint, we set
lastReplayedEndRecPtr incorrectly to the record before the checkpoint
record, instead of the checkpoint record itself. We don't run the redo
routine of the shutdown checkpoint record, but starting recovery from it
goes through the same motions, so it should be considered as replayed.
Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected,
so backpatch to 9.0.
Commit abf5c5c9a4 added a bogus while-
statement after the for(;;)-loop. It went unnoticed in testing, because
it was dead code.
Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced
this was also applied to 9.2, but not the bogus while-loop part, because
the code in 9.2 looks quite different.
Coverity identified a number of places in which it couldn't prove that a
string being copied into a fixed-size buffer would fit. We believe that
most, perhaps all of these are in fact safe, or are copying data that is
coming from a trusted source so that any overrun is not really a security
issue. Nonetheless it seems prudent to forestall any risk by using
strlcpy() and similar functions.
Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports.
In addition, fix a potential null-pointer-dereference crash in
contrib/chkpass. The crypt(3) function is defined to return NULL on
failure, but chkpass.c didn't check for that before using the result.
The main practical case in which this could be an issue is if libc is
configured to refuse to execute unapproved hashing algorithms (e.g.,
"FIPS mode"). This ideally should've been a separate commit, but
since it touches code adjacent to one of the buffer overrun changes,
I included it in this commit to avoid last-minute merge issues.
This issue was reported by Honza Horak.
Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
If there is a WAL segment with same ID but different TLI present in both
the WAL archive and pg_xlog, prefer the one with higher TLI. Before this
patch, the archive was polled first, for all expected TLIs, and only if no
file was found was pg_xlog scanned. This was a change in behavior from 9.3,
which first scanned archive and pg_xlog for the highest TLI, then archive
and pg_xlog for the next highest TLI and so forth. This patch reverts the
behavior back to what it was in 9.2.
The reason for this is that if for example you try to do archive recovery
to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is
not archived yet, we would replay past the timeline switch point on
timeline 1 using the archived files, before even looking timeline 2's files
in pg_xlog
Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior
was changed.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted. (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.) In Hot Standby,
the WAL replay process must likewise lock every leaf page. There were
several bugs in the code for that:
* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error. This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages". To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.
* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about). Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.
* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page. However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.
The first two of these bugs were diagnosed by Andres Freund, the other one
by me. Fixes based on ideas from Heikki Linnakangas and myself.
This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
If pause_at_recovery_target is set, recovery pauses *before* applying the
target record, even if recovery_target_inclusive is set. If you then
continue with pg_xlog_replay_resume(), it will apply the target record
before ending recovery. In other words, if you log in while it's paused
and verify that the database looks OK, ending recovery changes its state
again, possibly destroying data that you were tring to salvage with PITR.
Backpatch to 9.1, this has been broken since pause_at_recovery_target was
added.
When starting WAL replay from an online checkpoint, the last replayed WAL
record variable was initialized using the checkpoint record's location, even
though the records between the REDO location and the checkpoint record had
not been replayed yet. That was noted as "slightly confusing" but harmless
in the comment, but in some cases, it fooled CheckRecoveryConsistency to
incorrectly conclude that we had already reached a consistent state
immediately at the beginning of WAL replay. That caused the system to accept
read-only connections in hot standby mode too early, and also PANICs with
message "WAL contains references to invalid pages".
Fix by initializing the variables to the REDO location instead.
In 9.2 and above, change CheckRecoveryConsistency() to use
lastReplayedEndRecPtr variable when checking if backup end location has
been reached. It was inconsistently using EndRecPtr for that check, but
lastReplayedEndRecPtr when checking min recovery point. It made no
difference before this patch, because in all the places where
CheckRecoveryConsistency was called the two variables were the same, but
it was always an accident waiting to happen, and would have been wrong
after this patch anyway.
Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0,
where hot standby was introduced.
And the same for do_pg_stop_backup. The code in do_pg_* is not allowed
to access the catalogs. For manual base backups, the permissions
check can be handled in the calling function, and for streaming
base backups only users with the required permissions can get past
the authentication step in the first place.
Reported by Antonin Houska, diagnosed by Andres Freund
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway. Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it. This has made not truncating the content of
pg_multixact/ not defensible anymore.
To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().
At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.
Back-patch to 9.0. 8.4 also has the problem, but since there's no hot
standby there, it's much less pressing. In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required. Add appropriate checks to make sure that's not
happening.
Andres Freund
This keeps the usual trigger file name unchanged from 9.2, avoiding nasty
issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa.
The fallback behavior of creating a full checkpoint before starting up is now
triggered by a file called "fallback_promote". That can be useful for
debugging purposes, but we don't expect any users to have to resort to that
and we might want to remove that in the future, which is why the fallback
mechanism is undocumented.
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.
Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
elog.c has historically treated LOG messages as low-priority during
bootstrap and standalone operation. This has led to confusion and even
masked a bug, because the normal expectation of code authors is that
elog(LOG) will put something into the postmaster log, and that wasn't
happening during initdb. So get rid of the special-case rule and make
the priority order the same as it is in normal operation. To keep from
cluttering initdb's output and the behavior of a standalone backend,
tweak the severity level of three messages routinely issued by xlog.c
during startup and shutdown so that they won't appear in these cases.
Per my proposal back in December.
Since commit f21bb9cfb5, this function
ignores the caller-provided length and loops until it finds a
terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the
previous loop control logic. In passing, revert the addition of an
unused variable by the same commit, presumably a debugging relic.
Seems cleaner to get the currently-replayed TLI in the same call to
GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the
comment what the code does when recovery has already ended
(RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make
resetting ThisTimeLineID afterwards more explicit.
Make slightly better decisions about indentation than what pgindent
is capable of. Mostly breaking out long function calls into one
line per argument, with a few other minor adjustments.
No functional changes- all whitespace.
pgindent ran cleanly (didn't change anything) after.
Passes all regressions.
checkpointer needs to reset ThisTimeLineID after
a restartpoint to allow installing/recycling new
WAL files. If recovery has already ended this
would leave ThisTimeLineID set incorrectly and
so we must reset it otherwise later checkpoints
do not have the correct timeline.
Bug report by Heikki Linnakangas.
Further investigation by Heikki and myself.
This simplifies the handling of crashes after fast promotion and various
minor cases that can exist in short timing windows around that case.
Broad fix to bug reported by Michael Paquier on -hackers,
approach prompted by Heikki Linnakangas
If a standby server has a cascading standby server connected to it, it's
possible that WAL has already been sent up to the next WAL page boundary,
splitting a WAL record in the middle, when the first standby server is
promoted. Don't throw an assertion failure or error in walsender if that
happens.
Also, fix a variant of the same bug in pg_receivexlog: if it had already
received WAL on previous timeline up to a segment boundary, when the
upstream standby server is promoted so that the timeline switch record falls
on the previous segment, pg_receivexlog would miss the segment containing
the timeline switch. To fix that, have walsender send the position of the
timeline switch at end-of-streaming, in addition to the next timeline's ID.
It was previously assumed that the switch happened exactly where the
streaming stopped.
Note: this is an incompatible change in the streaming protocol. You might
get an error if you try to stream over timeline switches, if the client is
running 9.3beta1 and the server is more recent. It should be fine after a
reconnect, however.
Reported by Fujii Masao.