On clean shutdown, walsender waits for all WAL to be replicated to a standby,
and exits. It determined whether that replication had been completed by
checking whether its sent location had been equal to a standby's flush
location. Unfortunately this condition never becomes true when the standby
such as pg_receivexlog which always returns an invalid flush location is
connecting to walsender, and then walsender waits forever.
This commit changes walsender so that it just checks a standby's write
location if a flush location is invalid.
Back-patch to 9.1 where enough infrastructure for this exists.
WalSndKill was doing things exactly backwards: it should first clear
MyWalSnd (to stop signal handlers from touching MyWalSnd->latch),
then disown the latch, and only then mark the WalSnd struct unused by
clearing its pid field.
Also, WalRcvSigUsr1Handler and worker_spi_sighup failed to preserve
errno, which is surely a requirement for any signal handler.
Per discussion of recent buildfarm failures. Back-patch as far
as the relevant code exists.
If a tablespace was crated inside PGDATA it was backed up both as part
of the PGDATA backup and as the backup of the tablespace. Avoid this
by skipping any directory inside PGDATA that contains one of the active
tablespaces.
Dimitri Fontaine and Magnus Hagander
In replication, when we shutdown the master, walsender tries to send
all the outstanding WAL records to the standby, and then to exit. This
basically means that all the WAL records are fully synced between
two servers after the clean shutdown of the master. So, after
promoting the standby to new master, we can restart the stopped
master as new standby without the need for a fresh backup from
new master.
But there was one problem so far: though walsender tries to send all
the outstanding WAL records, it doesn't wait for them to be replicated
to the standby. Then, before receiving all the WAL records,
walreceiver can detect the closure of connection and exit. We cannot
guarantee that there is no missing WAL in the standby after clean
shutdown of the master. In this case, backup from new master is
required when restarting the stopped master as new standby.
This patch fixes this problem. It just changes walsender so that it
waits for all the outstanding WAL records to be replicated to the
standby before closing the replication connection.
Per discussion, this is a fix that needs to get backpatched rather than
new feature. So, back-patch to 9.1 where enough infrastructure for
this exists.
Patch by me, reviewed by Andres Freund.
If you have clusters of different versions pointing to the same tablespace
location, we would incorrectly include all the data belonging to the other
versions, too.
Fixes bug #7986, reported by Sergey Burladyan.
If a relation file was removed when the server-side counterpart of
pg_basebackup was just about to open it to send it to the client, you'd
get a "could not open file" error. Fix that.
Backpatch to 9.1, this goes back to when pg_basebackup was introduced.
XLogRecPtrIsInvalid() only checks the xrecoff field, which is correct when
checking if a WAL record could legally begin at the given position, but WAL
sending can legally be paused at a page boundary, in which case xrecoff is
0. Use XLByteEQ(..., InvalidXLogRecPtr) instead, which checks that both
xlogid and xrecoff are 0.
9.3 doesn't have this problem because XLogRecPtr is now a single 64-bit
integer, so XLogRecPtrIsInvalid() does the right thing. Apply to 9.2, and
9.1 where pg_stat_replication view was introduced.
Kyotaro HORIGUCHI, reviewed by Fujii Masao.
Back-patch portions of commit 05b555d12b.
There doesn't seem to be any reason not to fix pg_basebackup fully, but
we can't change pg_dump's "magic" string without breaking older versions
of pg_restore. Instead, just patch pg_restore to accept either version
of the magic string, in hopes of avoiding compatibility problems when
9.3 comes out. I also fixed pg_dump to write the correct 2-block EOF
marker, since that won't create a compatibility problem with pg_restore
and it could help with some versions of tar.
Brian Weaver and Tom Lane
Walsenders must have working SIGALRM handling during InitPostgres,
but they set the handler to SIG_IGN so that nothing would happen
if a timeout was reached. This could result in two failure modes:
* If a walsender participated in a deadlock during its authentication
transaction, and was the last to wait in the deadly embrace, the deadlock
would not get cleared automatically. This would require somebody to be
trying to take out AccessExclusiveLock on multiple system catalogs, so
it's not very probable.
* If a client failed to respond to a walsender's authentication challenge,
the intended disconnect after AuthenticationTimeout wouldn't happen, and
the walsender would wait indefinitely for the client.
For the moment, fix in back branches only, since this is fixed in a
different way in the timeout-infrastructure patch that's awaiting
application to HEAD. If we choose not to apply that, then we'll need
to do this in HEAD as well.
This ensures that a standby such as pg_receivexlog will not be selected
as sync standby - which would cause the master to block waiting for
a location that could never happen.
Fujii Masao
This prevents a pg_basebackup backup session that just does a base
backup (no xlog involved at all) from becoming the synchronous slave
and thus blocking all access while it runs.
Also fixes the problem when a higher priority slave shows up it would
become the sync standby before it has reached the STREAMING state, by
making sure we can only switch to a walsender that's actually STREAMING.
Fujii Masao
The problem was that ResetLatch was not being called in the walsender loop
if the connection was terminated, so WaitLatch never sleeps until the
terminated connection is detected. In the master-branch, this was already
fixed as a side-effect of some refactoring of the loop. This commit
backports that refactoring to 9.1. 9.0 does not have this bug, because we
didn't use latches back then.
Fujii Masao
Make sure all calls are protected by HAVE_READLINK, and get the buffer
overflow tests right. Be a bit more paranoid about string length in
_tarWriteHeader(), too.
We don't have any such platforms now, but might in the future.
Also, detect cases when a tablespace symlink points to a path that
is longer than we can handle, and give a warning.
No need to do "errcode(errcode_for_file_access())", just
"errcode_for_file_access()" is enough. The extra errcode() call is useless
but harmless, so there's no user-visible bug here. Nevertheless, backpatch
to 9.1 where this code were added.
There's no need to clamp the standby's xmin to be greater than
GetOldestXmin's result; if there were any such need this logic would be
hopelessly inadequate anyway, because it fails to account for
within-database versus cluster-wide values of GetOldestXmin. So get rid of
that, and just rely on sanity-checking that the xmin is not wrapped around
relative to the nextXid counter. Also, don't reset the walsender's xmin if
the current feedback xmin is indeed out of range; that just creates more
problems than we already had. Lastly, don't bother to take the
ProcArrayLock; there's no need to do that to set xmin.
Also improve the comments about this in GetOldestXmin itself.
In oder to exit on SIGTERM when in non-walsender code,
such as do_pg_stop_backup(), we need to set the interrupt
variables that are used there, and not just the walsender
local ones.
Fix a whole bunch of signal handlers that had been hacked to do things that
might change errno, without adding the necessary save/restore logic for
errno. Also make some minor fixes in unix_latch.c, and clean up bizarre
and unsafe scheme for disowning the process's latch. While at it, rename
the PGPROC latch field to procLatch for consistency with 9.2.
Issues noted while reviewing a patch by Peter Geoghegan.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines. Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.
Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.
Peter Geoghegan
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code. Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.
This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them. In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
Somebody thought it'd be cute to invent a set of Node tag numbers that were
defined independently of, and indeed conflicting with, the main tag-number
list. While this accidentally failed to fail so far, it would certainly
lead to trouble as soon as anyone wanted to, say, apply copyObject to these
node types. Clang was already complaining about the use of makeNode on
these tags, and I think quite rightly so. Fix by pushing these node
definitions into the mainstream, including putting replnodes.h where it
belongs.
The previous functions of assign hooks are now split between check hooks
and assign hooks, where the former can fail but the latter shouldn't.
Aside from being conceptually clearer, this approach exposes the
"canonicalized" form of the variable value to guc.c without having to do
an actual assignment. And that lets us fix the problem recently noted by
Bernd Helmle that the auto-tune patch for wal_buffers resulted in bogus
log messages about "parameter "wal_buffers" cannot be changed without
restarting the server". There may be some speed advantage too, because
this design lets hook functions avoid re-parsing variable values when
restoring a previous state after a rollback (they can store a pre-parsed
representation of the value instead). This patch also resolves a
longstanding annoyance about custom error messages from variable assign
hooks: they should modify, not appear separately from, guc.c's own message
about "invalid parameter value".
This means one less thing to configure when setting up synchronous
replication, and also avoids some ambiguity around what the behavior
should be when the settings of these variables conflict.
Fujii Masao, with additional hacking by me.
If a smart shutdown occurs just as a child is starting up, and the
child subsequently becomes a walsender, there is a race condition:
the postmaster might count the exstant backends, determine that there
is one normal backend, and wait for it to die off. Had the walsender
transition already occurred before the postmaster counted, it would
have proceeded with the shutdown.
To fix this, have each child that transforms into a walsender kick
the postmaster just after doing so, so that the state machine is
certain to advance.
Fujii Masao
than replication_timeout (a new GUC) milliseconds. The TCP timeout is often
too long, you want the master to notice a dead connection much sooner.
People complained about that in 9.0 too, but with synchronous replication
it's even more important to notice dead connections promptly.
Fujii Masao and Heikki Linnakangas
It originally worked this way, but was changed by commit
a8a8a3e096, since which time it's been impossible
for walreceiver to ever send a reply with write_location and flush_location
set to different values.
This is advantageous because the BG writer is alive until much later in
the shutdown sequence than WAL writer; we want to make sure that it's
possible to shut off synchronous replication during a smart shutdown,
else it might not be possible to complete the shutdown at all.
Per very reasonable gripes from Fujii Masao and Simon Riggs.
1. Don't ignore query cancel interrupts. Instead, if the user asks to
cancel the query after we've already committed it, but before it's on
the standby, just emit a warning and let the COMMIT finish.
2. Don't ignore die interrupts (pg_terminate_backend or fast shutdown).
Instead, emit a warning message and close the connection without
acknowledging the commit. Other backends will still see the effect of
the commit, but there's no getting around that; it's too late to abort
at this point, and ignoring die interrupts altogether doesn't seem like
a good idea.
3. If synchronous_standby_names becomes empty, wake up all backends
waiting for synchronous replication to complete. Without this, someone
attempting to shut synchronous replication off could easily wedge the
entire system instead.
4. Avoid depending on the assumption that if a walsender updates
MyProc->syncRepState, we'll see the change even if we read it without
holding the lock. The window for this appears to be quite narrow (and
probably doesn't exist at all on machines with strong memory ordering)
but protecting against it is practically free, so do that.
5. Remove useless state SYNC_REP_MUST_DISCONNECT, which isn't needed and
doesn't actually do anything.
There's still some further work needed here to make the behavior of fast
shutdown plausible, but that looks complex, so I'm leaving it for a
separate commit. Review by Fujii Masao.
It's not a good idea to kill the postmaster just because someone muffs
this, and it's not consistent with what we do for other, similar GUCs.
Fujii Masao, with a bit more hacking by me
SyncRepRequested() must check not only the value of the
synchronous_replication GUC but also whether max_wal_senders > 0.
Otherwise, we might end up waiting for sync rep even when there's no
possibility of a standby ever managing to connect. There are some
existing cross-checks to prevent this, but they're not quite sufficient:
the user can start the server with max_wal_senders=0,
synchronous_standby_names='', and synchronous_replication=off and then
subsequent make synchronous_standby_names not empty using pg_ctl reload,
and then SET synchronous_standby=on, leading to an indefinite hang.
Along the way, rename the global variable for the synchronous_replication
GUC to match the name of the GUC itself, for clarity.
Report by Fujii Masao, though I didn't use his patch.