Previously we used ReadRecPtr rather than EndRecPtr, which was
not a serious error but caused pg_stat_replication to report
incorrect replay_location until at least one WAL record is replayed.
Fujii Masao
Fix a longstanding thinko in replay of NEXTOID and checkpoint records: we
tried to advance nextOid only if it was behind the value in the WAL record,
but the comparison would draw the wrong conclusion if OID wraparound had
occurred since the previous value. Better to just unconditionally assign
the new value, since OID assignment shouldn't be happening during replay
anyway.
The consequences of a failure to update nextOid would be pretty minimal,
since we have long had the code set up to obtain another OID and try again
if the generated value is already in use. But in the worst case there
could be significant performance glitches while such loops iterate through
many already-used OIDs before finding a free one.
The odds of a wraparound happening during WAL replay would be small in a
crash-recovery scenario, and the length of any ensuing OID-assignment stall
quite limited anyway. But neither of these statements hold true for a
replication slave that follows a WAL stream for a long period; its behavior
upon going live could be almost unboundedly bad. Hence it seems worth
back-patching this fix into all supported branches.
Already fixed in HEAD in commit c6d76d7c82.
RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation. The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer. Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200. It most likely explains the
original report as well, though we don't yet have confirmation of that.
To fix, change the code so that only bytes that are supposed to change will
change, even transiently. This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.
Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.
So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.
Back-patch to 9.0 where Hot Standby was added.
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.
In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.
Bug report by Chris Redekop
streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).
To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).
This is a different implementation than what I committed to 9.2 a week ago.
That implementation was not back-patchable because it required re-initdb.
Fujii Masao
Fix a whole bunch of signal handlers that had been hacked to do things that
might change errno, without adding the necessary save/restore logic for
errno. Also make some minor fixes in unix_latch.c, and clean up bizarre
and unsafe scheme for disowning the process's latch. While at it, rename
the PGPROC latch field to procLatch for consistency with 9.2.
Issues noted while reviewing a patch by Peter Geoghegan.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines. Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.
Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.
Peter Geoghegan
ReadRecord's habit of using both direct references to tmpRecPtr and
references to *RecPtr (which is pointing at tmpRecPtr) triggers an
optimization bug in gcc 4.6.0, which apparently has forgotten about
aliasing rules. Avoid the compiler bug, and make the code more readable
to boot, by getting rid of the direct references. Improve the comments
while at it.
Back-patch to all supported versions, in case they get built with 4.6.0.
Tom Lane, with some cosmetic suggestions from Alex Hunsaker
Before commit c016ce7281, this wasn't
needed, but now that multiple resource manager IDs can percolate down
through here, we have to make sure we know which one we've got.
Otherwise, we can confuse (for example) an XLOG_XACT_COMMIT record
with an XLOG_CHECKPOINT_SHUTDOWN record.
Review by Jaime Casanova
crash recovery, and throw an error if not. hubert depesz lubaczewski pointed
out that that situation also happens in the crash recovery following a
system crash that happens during an online backup.
We might want to do something smarter in 9.1, like put the check back for
backups taken with pg_basebackup, but that's for another patch.
The previous functions of assign hooks are now split between check hooks
and assign hooks, where the former can fail but the latter shouldn't.
Aside from being conceptually clearer, this approach exposes the
"canonicalized" form of the variable value to guc.c without having to do
an actual assignment. And that lets us fix the problem recently noted by
Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus
log messages about "parameter "wal_buffers" cannot be changed without
restarting the server". There may be some speed advantage too, because
this design lets hook functions avoid re-parsing variable values when
restoring a previous state after a rollback (they can store a pre-parsed
representation of the value instead). This patch also resolves a
longstanding annoyance about custom error messages from variable assign
hooks: they should modify, not appear separately from, guc.c's own message
about "invalid parameter value".
archive recovery.
It's possible to restore an online backup without recovery.conf, by simply
copying all the necessary WAL files to pg_xlog. "pg_basebackup -x" does that
too. That's the use case where this cross-check is useful.
Backpatch to 9.0. We used to do this in earlier versins, but in 9.0 the code
was inadvertently changed so that the check is only performed after archive
recovery.
Fujii Masao.
Change location LOG message so it works each time we pause, not
just for final pause.
Ensure that we pause only if we are in Hot Standby and can connect
to allow us to run resume function. This change supercedes the
code to override parameter recoveryPauseAtTarget to false if not
attempting to enter Hot Standby, which is now removed.
Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.
ensure that they use different checkpoints as the starting point. We use
the checkpoint redo location as a unique identifier for the base backup in
the end-of-backup record, and in the backup history file name.
Bug spotted by Fujii Masao.
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.
This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
periodically rescan the archive for new timelines, while waiting for new WAL
segments to arrive. This allows you to set up a standby server that follows
the TLI change if another standby server is promoted to master. Before this,
you had to restart the standby server to make it notice the new timeline.
This patch only scans the archive for TLI changes, it won't follow a TLI
change in streaming replication. That is much needed too, but it would be a
much bigger patch than I dare to sneak in this late in the release cycle.
There was discussion on improving the sanity checking of the WAL segments so
that the system would notice more reliably if the new timeline isn't an
ancestor of the current one, but that is not included in this patch.
Reviewed by Fujii Masao.
Emit a log message when creating a named restore point, and improve
documentation for pg_create_restore_point().
Euler Taveira de Oliveira, per suggestions from Thom Brown, with some
additional wordsmithing by me.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
though must not update the last transaction timestamp.
Plus comment and message cleanup for recent named restore point.
Fujii Masao, minor changes by me
the standby has written, flushed, and applied the WAL. At the moment, this
is for informational purposes only, the values are only shown in
pg_stat_replication system view, but in the future they will also be needed
for synchronous replication.
Extracted from Simon riggs' synchronous replication patch by Robert Haas, with
some tweaking by me.
Specifying this option makes the server not wait for the
xlog to be archived, or emit a warning that it can't,
instead leaving the responsibility with the client.
This is useful when the log is being streamed using
the streaming protocol in parallel with the backup,
without having log archiving enabled.
new recovery.conf parameter recovery_target_name allows PITR to
specify named points as recovery targets.
Jaime Casanova, reviewed by Euler Taveira de Oliveira, plus minor edits
If the standby was streaming when trigger file arrives, check also in the
archive for additional WAL files. This is a corner case since it is
unlikely that we would trigger a failover while the master is still
available and sending data to standby, while at the same time running in
archive mode and also while the streaming standby has fallen behind archive.
Someone would eventually be unlucky; we must plug all gaps however small.
Fujii Masao
Prior to 9.0, restartpoints never created, deleted, or recycled WAL
files, but now they can. This code makes log_checkpoints treat
checkpoints and restartpoints symmetrically. It also adjusts up
the documentation of the parameter to mention restartpoints.
Fujii Masao. Docs by me, as suggested by Itagaki Takahiro.
With this patch, pg_basebackup doesn't write a backup_label file in the
data directory, so it doesn't interfere with a pg_start/stop_backup() based
backup anymore. backup_label is still included in the backup, but it is
injected directly into the tar stream.
Heikki Linnakangas, reviewed by Fujii Masao and Magnus Hagander.
If wal_buffers is initially set to -1 (which is now the default), it's
replaced by 1/32nd of shared_buffers, with a minimum of 8 (the old default)
and a maximum of the XLOG segment size. The allowed range for manual
settings is still from 4 up to whatever will fit in shared memory.
Greg Smith, with implementation correction by me.
Move the actual functionality into a separate function that's
easier to call internally, and change the SQL-callable function
to be a wrapper calling this.
Also create a pg_abort_backup() function, only callable internally,
that does only the most vital parts of pg_stop_backup(), making it
safe(r) to call from error handlers.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery. Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.
Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
Purely cosmetic patch to make our coding standards more consistent ---
we were doing symbolic some places and octal other places. This patch
fixes all C-coded uses of mkdir, chmod, and umask. There might be some
other calls I missed. Inconsistency noted while researching tablespace
directory permissions issue.
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.
In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.
In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
This eliminates some crufty, special-purpose code and, as a non-trivial
side benefit, allows recovery.conf parameters to be unquoted.
Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki
Takahiro, and me.