The postmaster now checks every minute or so (worst case, at most two
minutes) that postmaster.pid is still there and still contains its own PID.
If not, it performs an immediate shutdown, as though it had received
SIGQUIT.
The original goal behind this change was to ensure that failed buildfarm
runs would get fully cleaned up, even if the test scripts had left a
postmaster running, which is not an infrequent occurrence. When the
buildfarm script removes a test postmaster's $PGDATA directory, its next
check on postmaster.pid will fail and cause it to exit. Previously, manual
intervention was often needed to get rid of such orphaned postmasters,
since they'd block new test postmasters from obtaining the expected socket
address.
However, by checking postmaster.pid and not something else, we can provide
additional robustness: manual removal of postmaster.pid is a frequent DBA
mistake, and now we can at least limit the damage that will ensue if a new
postmaster is started while the old one is still alive.
Back-patch to all supported branches, since we won't get the desired
improvement in buildfarm reliability otherwise.
We should set MyProc->databaseId after acquiring the per-database lock,
not beforehand. The old way risked deadlock against processes trying to
copy or delete the target database, since they would first acquire the lock
and then wait for processes with matching databaseId to exit; that left a
window wherein an incoming process could set its databaseId and then block
on the lock, while the other process had the lock and waited in vain for
the incoming process to exit.
CountOtherDBBackends() would time out and fail after 5 seconds, so this
just resulted in an unexpected failure not a permanent lockup, but it's
still annoying when it happens. A real-world example of a use-case is that
short-duration connections to a template database should not cause CREATE
DATABASE to fail.
Doing it in the other order should be fine since the contract has always
been that processes searching the ProcArray for a database ID must hold the
relevant per-database lock while searching. Thus, this actually removes
the former race condition that required an assumption that storing to
MyProc->databaseId is atomic.
It's been like this for a long time, so back-patch to all active branches.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.
We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.
To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.
To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.
Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.
Security: CVE-2015-0244
This was not changed in HEAD, but will be done later as part of a
pgindent run. Future pgindent runs will also do this.
Report by Tom Lane
Backpatch through all supported branches, but not HEAD
An oversight in commit e710b65c1c allowed
database names beginning with "-" to be treated as though they were secure
command-line switches; and this switch processing occurs before client
authentication, so that even an unprivileged remote attacker could exploit
the bug, needing only connectivity to the postmaster's port. Assorted
exploits for this are possible, some requiring a valid database login,
some not. The worst known problem is that the "-r" switch can be invoked
to redirect the process's stderr output, so that subsequent error messages
will be appended to any file the server can write. This can for example be
used to corrupt the server's configuration files, so that it will fail when
next restarted. Complete destruction of database tables is also possible.
Fix by keeping the database name extracted from a startup packet fully
separate from command-line switches, as had already been done with the
user name field.
The Postgres project thanks Mitsumasa Kondo for discovering this bug,
Kyotaro Horiguchi for drafting the fix, and Noah Misch for recognizing
the full extent of the danger.
Security: CVE-2013-1899
The pg_start_backup() and pg_stop_backup() functions checked the privileges
of the initially-authenticated user rather than the current user, which is
wrong. For example, a user-defined index function could successfully call
these functions when executed by ANALYZE within autovacuum. This could
allow an attacker with valid but low-privilege database access to interfere
with creation of routine backups. Reported and fixed by Noah Misch.
Security: CVE-2013-1901
We failed to ever fill the sixth line (LISTEN_ADDR), which caused the
attempt to fill the seventh line (SHMEM_KEY) to fail, so that the shared
memory key never got added to the file in standalone mode. This has been
broken since we added more content to our lock files in 9.1.
To fix, tweak the logic in CreateLockFile to add an empty LISTEN_ADDR
line in standalone mode. This is a tad grotty, but since that function
already knows almost everything there is to know about the contents of
lock files, it doesn't seem that it's any better to hack it elsewhere.
It's not clear how significant this bug really is, since a standalone
backend should never have any children and thus it seems not critical
to be able to check the nattch count of the shmem segment externally.
But I'm going to back-patch the fix anyway.
This problem had escaped notice because of an ancient (and in hindsight
pretty dubious) decision to suppress LOG-level messages by default in
standalone mode; so that the elog(LOG) complaint in AddToDataDirLockFile
that should have warned of the problem didn't do anything. Fixing that
is material for a separate patch though.
The GUC check hooks for transaction_read_only and transaction_isolation
tried to check RecoveryInProgress(), so as to disallow setting read/write
mode or serializable isolation level (respectively) in hot standby
sessions. However, GUC check hooks can be called in many situations where
we're not connected to shared memory at all, resulting in a crash in
RecoveryInProgress(). Among other cases, this results in EXEC_BACKEND
builds crashing during child process start if default_transaction_isolation
is serializable, as reported by Heikki Linnakangas. Protect those calls
by silently allowing any setting when not inside a transaction; which is
okay anyway since these GUCs are always reset at start of transaction.
Also, add a check to GetSerializableTransactionSnapshot() to complain
if we are in hot standby. We need that check despite the one in
check_XactIsoLevel() because default_transaction_isolation could be
serializable. We don't want to complain any sooner than this in such
cases, since that would prevent running transactions at all in such a
state; but a transaction can be run, if SET TRANSACTION ISOLATION is done
before setting a snapshot. Per report some months ago from Robert Haas.
Back-patch to 9.1, since these problems were introduced by the SSI patch.
Kevin Grittner and Tom Lane, with ideas from Heikki Linnakangas
The accurate info about what's in a lock file has been in miscadmin.h
for some time, so let's just make this comment point there instead of
maintaining a duplicative copy.
superuser doesn't have doesn't make much sense, as a superuser can do
whatever he wants through other means, anyway. So instead of granting
replication privilege to superusers in CREATE USER time by default, allow
replication connection from superusers whether or not they have the
replication privilege.
Patch by Noah Misch, per discussion on bug report #6264
lost. The only way we detect that at the moment is when write() fails when
we try to write to the socket.
Florian Pflug with small changes by me, reviewed by Greg Jaskiewicz.
This adds some I/O stats to the logging of autovacuum (when the
operation takes long enough that log_autovacuum_min_duration causes it
to be logged), so that it is easier to tune. Notably, it adds buffer
I/O counts (hits, misses, dirtied) and read and write rate.
Authors: Greg Smith and Noah Misch
The statement start timestamp was not set before initiating the transaction
that is used to look up client authentication information in pg_authid.
In consequence, enable_sig_alarm computed a wrong value (far in the past)
for statement_fin_time. That didn't have any immediate effect, because the
timeout alarm was set without reference to statement_fin_time; but if we
subsequently blocked on a lock for a short time, CheckStatementTimeout
would consult the bogus value when we cancelled the lock timeout wait,
and then conclude we'd timed out, leading to immediate failure of the
connection attempt. Thus an innocent "vacuum full pg_authid" would cause
failures of concurrent connection attempts. Noted while testing other,
more serious consequences of vacuum full on system catalogs.
We should set the statement timestamp before StartTransactionCommand(),
so that the transaction start timestamp is also valid. I'm not sure if
there are any non-cosmetic effects of it not being valid, but the xact
timestamp is at least sent to the statistics machinery.
Back-patch to 9.0. Before that, the client authentication timeout was done
outside any transaction and did not depend on this state to be valid.
Failure to distinguish these cases is the real cause behind the recent
reports of Windows builds crashing on 'infinity'::timestamp, which was
directly due to failure to establish a value of timezone_abbreviations
in postmaster child processes. The postmaster had the desired value,
but write_one_nondefault_variable() didn't transmit it to backends.
To fix that, invent a new value PGC_S_DYNAMIC_DEFAULT, and be sure to use
that or PGC_S_ENV_VAR (as appropriate) for "default" settings that are
computed during initialization. (We need both because there's at least
one variable that could receive a value from either source.)
This commit also fixes ProcessConfigFile's failure to restore the correct
default value for certain GUC variables if they are set in postgresql.conf
and then removed/commented out of the file. We have to recompute and
reinstall the value for any GUC variable that could have received a value
from PGC_S_DYNAMIC_DEFAULT or PGC_S_ENV_VAR sources, and there were a
number of oversights. (That whole thing is a crock that needs to be
redesigned, but not today.)
However, I intentionally didn't make it work "exactly right" for the cases
of timezone and log_timezone. The exactly right behavior would involve
running select_default_timezone, which we'd have to do independently in
each postgres process, causing the whole database to become entirely
unresponsive for as much as several seconds. That didn't seem like a good
idea, especially since the variable's removal from postgresql.conf might be
just an accidental edit. Instead the behavior is to adopt the previously
active setting as if it were default.
Note that this patch creates an ABI break for extensions that use any of
the PGC_S_XXX constants; they'll need to be recompiled.
This option turns off autovacuum, prevents non-super-user connections,
and enables oid setting hooks in the backend. The code continues to use
the old autoavacuum disable settings for servers with earlier catalog
versions.
This includes a catalog version bump to identify servers that support
the -b option.
Fix broken test for pre-existing postmaster, caused by wrong code for
appending lines to the lockfile; don't write a failed listen_address
setting into the lockfile; don't arbitrarily change the location of the
data directory in the lockfile compared to previous releases; provide more
consistent and useful definitions of the socket path and listen_address
entries; avoid assuming that pg_ctl has the same DEFAULT_PGSOCKET_DIR as
the postmaster; assorted code style improvements.
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.
Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
After a SQL object is created, we provide an opportunity for security
or logging plugins to get control; for example, a security label provider
could use this to assign an initial security label to newly created
objects. The basic infrastructure is (hopefully) reusable for other types
of events that might require similar treatment.
KaiGai Kohei, with minor adjustments.
make sense for walsender, but for example application_name and client_encoding
do. We still don't apply per-role settings from pg_db_role_setting, because
that would require connecting to a database to read the table.
Fujii Masao
socket lockfile) when writing them. The lack of an fsync here may well
explain two different reports we've seen of corrupted lockfile contents,
which doesn't particularly bother the running server but can prevent a
new server from starting if the old one crashes. Per suggestion from
Alvaro.
Back-patch to all supported versions.
Normal superuser processes are allowed to connect even when the database
system is shutting down, or when fewer than superuser_reserved_connection
slots remain. This is intended to make sure an administrator can log in
and troubleshoot, so don't extend these same courtesies to users connecting
for replication.
database to connect to. This is necessary for the walsender code to work
properly (it was previously using an untenable assumption that template1 would
always be available to connect to). This also gets rid of a small security
shortcoming that was introduced in the original patch to eliminate the flat
authentication files: before, you could find out whether or not the requested
database existed even if you couldn't pass the authentication checks.
The changes needed to support this are mainly just to treat pg_authid and
pg_auth_members as nailed relations, so that we can read them without having
to be able to locate real pg_class entries for them. This mechanism was
already debugged for pg_database, but we hadn't recognized the value of
applying it to those catalogs too.
Since the current code doesn't have support for accessing toast tables before
we've brought up all of the relcache, remove pg_authid's toast table to ensure
that no one can store an out-of-line toasted value of rolpassword. The case
seems quite unlikely to occur in practice, and was effectively unsupported
anyway in the old "flatfiles" implementation.
Update genbki.pl to actually implement the same rules as bootstrap.c does for
not-nullability of catalog columns. The previous coding was a bit cheesy but
worked all right for the previous set of bootstrap catalogs. It does not work
for pg_authid, where rolvaliduntil needs to be nullable.
Initdb forced due to minor catalog changes (mainly the toast table removal).
those process types that go through InitPostgres; in particular, bootstrap
and standalone-backend cases. This ensures that we have set up a PGPROC
and done some other basic initialization steps (corresponding to the
if (IsUnderPostmaster) block in AuxiliaryProcessMain) before we attempt to
run WAL recovery in a standalone backend. As was discovered last September,
this is necessary for some corner-case code paths during WAL recovery,
particularly end-of-WAL cleanup.
Moving the bootstrap case here too is not necessary for correctness, but it
seems like a good idea since it reduces the number of distinct code paths.
was broken for a replication connection and no messages were
displayed on either standby or primary, at any debug level.
Connection messages needed to diagnose session drop/reconnect
events. Use LOG mode for now, discuss lowering in later releases.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4). This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.
Design and review by Tom Lane.
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me