The old formula didn't take into account that each WAL sender process needs
a spinlock. We had also already exceeded the fixed number of spinlocks
reserved for misc purposes (10). Bump that to 30.
Backpatch to 9.0, where WAL senders were introduced. If I counted correctly,
9.0 had exactly 10 predefined spinlocks, and 9.1 exceeded that, but bump the
limit in 9.0 too because 10 is uncomfortably close to the edge.
The point of turning off track_activities is to avoid this reporting
overhead, but a thinko in commit 4f42b546fd87a80be30c53a0f2c897acb826ad52
caused pgstat_report_activity() to perform half of its updates anyway.
Fix that, and also make sure that we clear all the now-disabled fields
when transitioning to the non-reporting state.
Notice and complain about PQcancel() failures. Also, don't dump core if
an error PGresult doesn't contain severity and message subfields, as it
might not if it was generated by libpq itself. (We have a longstanding
TODO item to improve that, but in the meantime isolationtester had better
cope.)
I tripped across the latter item while investigating a trouble report on
buildfarm member spoonbill. As for the former, there's no evidence that
PQcancel failure is actually involved in spoonbill's problem, but it still
seems like a bad idea to ignore an error return code.
An oversight in commit e710b65c1c56ca7b91f662c63d37ff2e72862a94 allowed
database names beginning with "-" to be treated as though they were secure
command-line switches; and this switch processing occurs before client
authentication, so that even an unprivileged remote attacker could exploit
the bug, needing only connectivity to the postmaster's port. Assorted
exploits for this are possible, some requiring a valid database login,
some not. The worst known problem is that the "-r" switch can be invoked
to redirect the process's stderr output, so that subsequent error messages
will be appended to any file the server can write. This can for example be
used to corrupt the server's configuration files, so that it will fail when
next restarted. Complete destruction of database tables is also possible.
Fix by keeping the database name extracted from a startup packet fully
separate from command-line switches, as had already been done with the
user name field.
The Postgres project thanks Mitsumasa Kondo for discovering this bug,
Kyotaro Horiguchi for drafting the fix, and Noah Misch for recognizing
the full extent of the danger.
Security: CVE-2013-1899
The pg_start_backup() and pg_stop_backup() functions checked the privileges
of the initially-authenticated user rather than the current user, which is
wrong. For example, a user-defined index function could successfully call
these functions when executed by ANALYZE within autovacuum. This could
allow an attacker with valid but low-privilege database access to interfere
with creation of routine backups. Reported and fixed by Noah Misch.
Security: CVE-2013-1901
In commit 0f61d4dd1b4f95832dcd81c9688dac56fd6b5687, I added code to copy up
column width estimates for each column of a subquery. That code supposed
that the subquery couldn't have any output columns that didn't correspond
to known columns of the current query level --- which is true when a query
is parsed from scratch, but the assumption fails when planning a view that
depends on another view that's been redefined (adding output columns) since
the upper view was made. This results in an assertion failure or even a
crash, as per bug #8025 from lindebg. Remove the Assert and instead skip
the column if its resno is out of the expected range.
9.2 uses a kluge representation of "indislive"; we have to account for
that when examining pg_index. Simplest solution is to check indisready
for 9.0 and 9.1 as well; that's harmless though unnecessary, so it's
not worth making a version distinction for.
Fixes oversight in commit 683abc73dff549e94555d4020dae8d02f32ed78b,
as noted by Andres Freund.
Previously, if the postmaster initialized OpenSSL's PRNG (which it will do
when ssl=on in postgresql.conf), the same pseudo-random state would be
inherited by each forked child process. The problem is masked to a
considerable extent if the incoming connection uses SSL encryption, but
when it does not, identical pseudo-random state is made available to
functions like contrib/pgcrypto. The process's PID does get mixed into any
requested random output, but on most systems that still only results in 32K
or so distinct random sequences available across all Postgres sessions.
This might allow an attacker who has database access to guess the results
of "secure" operations happening in another session.
To fix, forcibly reset the PRNG after fork(). Each child process that has
need for random numbers from OpenSSL's generator will thereby be forced to
go through OpenSSL's normal initialization sequence, which should provide
much greater variability of the sequences. There are other ways we might
do this that would be slightly cheaper, but this approach seems the most
future-proof against SSL-related code changes.
This has been assigned CVE-2013-1900, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
Back-patch to all supported branches.
Marko Kreen
In a heap update, if the old and new tuple were on different pages, and the
new page no longer existed (because it was subsequently truncated away by
vacuum), heap_xlog_update forgot to release the pin on the old buffer. This
bug was introduced by the "Fix multiple problems in WAL replay" patch,
commit 3bbf668de9f1bc172371681e80a4e769b6d014c8 (on master branch).
With full_page_writes=off, this triggered an "incorrect local pin count"
error later in replay, if the old page was vacuumed.
This fixes bug #7969, reported by Yunong Xiao. Backpatch to 9.0, like the
commit that introduced this bug.
Dumping invalid indexes can cause problems at restore time, for example
if the reason the index creation failed was because it tried to enforce
a uniqueness condition not satisfied by the table's data. Also, if the
index creation is in fact still in progress, it seems reasonable to
consider it to be an uncommitted DDL change, which pg_dump wouldn't be
expected to dump anyway.
Back-patch to all active versions, and teach them to ignore invalid
indexes in servers back to 8.2, where the concept was introduced.
Michael Paquier
If you have clusters of different versions pointing to the same tablespace
location, we would incorrectly include all the data belonging to the other
versions, too.
Fixes bug #7986, reported by Sergey Burladyan.
These programs don't work against 9.0 or earlier servers, so check that when
the connection is made. That's better than a cryptic error message you got
before.
Also, these programs won't work with a 9.3 server, because the WAL streaming
protocol was changed in a non-backwards-compatible way. As a general rule,
we don't make any guarantee that an old client will work with a new server,
so check that. However, allow a 9.1 client to connect to a 9.2 server, to
avoid breaking environments that currently work; a 9.1 client happens to
work with a 9.2 server, even though we didn't make any great effort to
ensure that.
This patch is for the 9.1 and 9.2 branches, I'll commit a similar patch to
master later. Although this isn't a critical bug fix, it seems safe enough
to back-patch. The error message you got when connecting to a 9.3devel
server without this patch was cryptic enough to warrant backpatching.
Most (all?) of Russia has moved to what's effectively year-round daylight
savings time, so that the "standard" zone names now mean an hour later
than they used to. Update that, notably changing MSK as per recent
complaint from Sergey Konoplev, but also CHOT, GET, IRKT, KGT, KRAT,
MAGT, NOVT, OMST, VLAT, YAKT, YEKT. The corresponding DST abbreviations
are presumably now obsolete, but I left them in place with their old
definitions, just to reduce any possible breakage from this change.
Also add VOLT (Europe/Volgograd), which for some reason we never had
before, as well as MIST (Antarctica/Macquarie), and fix obsolete
definitions of MAWT, TKT, and WST.
This appears to cause some intermittent file system problems
on Windows 8. Instead, set up the old data directory in its
intended final location to start with.
When RETURNING is specified, ExecDelete would return a virtual-tuple slot
that could contain pointers into an already-unpinned disk buffer. Another
process could change the buffer contents before we get around to using the
data, resulting in garbage results or even a crash. This seems of fairly
low probability, which may explain why there are no known field reports of
the problem, but it's definitely possible. Fix by forcing the result slot
to be "materialized" before we release pin on the disk buffer.
Back-patch to 9.0; in earlier branches there is no bug because
ExecProcessReturning sent the tuple to the destination immediately. Also,
this is already fixed in HEAD as part of the writable-foreign-tables patch
(where the fix is necessary for DELETE RETURNING to work at all with
postgres_fdw).
The previous coding of this function could get into situations where it
would never terminate, because successive passes would re-add EMPTY arcs
that had been removed by the previous pass. Rewrite the function
completely using a new algorithm that is guaranteed to terminate, and
also seems to be usually faster than the old one. Per Tcl bugs 3604074
and 3606683.
Tom Lane and Don Porter
If we were about to enter archive recovery after crash recovery, we scanned
the archive for the latest tli history file, and set the recovery target
timeline to that. However, when we actually tried to read the history file,
we would not fetch the file from the archive, because we were not in archive
recovery yet.
To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch
the file from archive if archive recovery is requested, even if we're not in
archive recovery yet.
Backpatch to 9.2. Mitsumasa KONDO
I missed to returns in the middle of ReadRecord function in my previous fix.
If a WAL file was not found at all during crash recovery, XLogPageRead would
return 'false', and ReadRecord would return without entering archive recovery.
9.2 only. In master, the code is structured differently and does not have this
problem.
Kyotaro HORIGUCHI, Mitsumasa KONDO and me.
formatting.c used locale-dependent case folding rules in some code paths
where the result isn't supposed to be locale-dependent, for example
to_char(timestamp, 'DAY'). Since the source data is always just ASCII
in these cases, that usually didn't matter ... but it does matter in
Turkish locales, which have unusual treatment of "i" and "I". To confuse
matters even more, the misbehavior was only visible in UTF8 encoding,
because in single-byte encodings we used pg_toupper/pg_tolower which
don't have locale-specific behavior for ASCII characters. Fix by providing
intentionally ASCII-only case-folding functions and using these where
appropriate. Per bug #7913 from Adnan Dursun. Back-patch to all active
branches, since it's been like this for a long time.
I fixed this code back in commit 841b4a2d5, but didn't think carefully
enough about the behavior near zero, which meant it improperly rejected
1999-12-31 24:00:00. Per report from Magnus Hagander.
fmgr_sql had been designed on the assumption that the FmgrInfo it's called
with has only query lifespan. This is demonstrably unsafe in connection
with range types, as shown in bug #7881 from Andrew Gierth. Fix things
so that we re-generate the function's cache data if the (sub)transaction
it was made in is no longer active.
Back-patch to 9.2. This might be needed further back, but it's not clear
whether the case can realistically arise without range types, so for now
I'll desist from back-patching further.
Careless use of TopMemoryContext for I/O function data meant that repeated
use of spi_prepare and spi_freeplan would leak memory at the session level,
as per report from Christian Schröder. In addition, spi_prepare
leaked a lot of transient data within the current plperl function's SPI
Proc context, which would be a problem for repeated use of spi_prepare
within a single plperl function call; and it wasn't terribly careful
about releasing permanent allocations in event of an error, either.
In passing, clean up some copy-and-pasteos in query-lookup error messages.
Alex Hunsaker and Tom Lane
parseqatom() failed to check for an error return (NULL result) from its
recursive call to parsebranch(), and in consequence could crash with a
null-pointer dereference after an error return. This bug has been there
since day one, but wasn't noticed before, probably because most error cases
in parsebranch() didn't actually lead to returning NULL. Add the missing
error check, and also tweak parsebranch() to exit in a less indirect
fashion after a call to parseqatom() fails.
Report by Tomasz Karlik, fix by me.
We must still initialize minRecoveryPoint if we start straight with archive
recovery, e.g when recovering from a normal base backup taken with
pg_start/stop_backup. Otherwise we never consider the system consistent.
If you create a base backup using an atomic filesystem snapshot, and try to
perform PITR starting from that base backup, or if you just kill a master
server and create recovery.conf to put it into standby mode, we don't know
how far we need to recover before reaching consistency. Normally in crash
recovery, we replay all the WAL present in pg_xlog, and assume that we're
consistent after that. And normally in archive recovery, minRecoveryPoint,
backupEndRequired, or backupEndPoint is set in the control file, indicating
how far we need to replay to reach consistency. But if the server was
previously up and running normally, and you kill -9 it or take an atomic
filesystem snapshot, none of those fields are set in the control file.
The solution is to perform crash recovery first, replaying all the WAL in
pg_xlog. After that's done, we assume that the system is consistent like in
normal crash recovery, and switch to archive recovery mode after that.
Per report from Kyotaro HORIGUCHI. In his scenario, recovery.conf was
created after "pg_ctl stop -m i". I'm not sure we need to support that exact
scenario, but we should support backing up using a filesystem snapshot,
which looks identical.
This issue goes back to at least 9.0, where hot standby was introduced and
we started to track when consistency is reached. In 9.1 and 9.2, we would
open up for hot standby too early, and queries could briefly see an
inconsistent state. But 9.2 made it more visible, as we started to PANIC if
we see a reference to a non-existing page during recovery, if we've already
reached consistency. This is a fairly big patch, so back-patch to 9.2 only,
where the issue is more visible. We can consider back-patching further after
this has received some more testing in 9.2 and master.
If a database name contained a '=' character, pg_dumpall failed. The problem
was in the way pg_dumpall passes the database name to pg_dump on the
command line. If it contained a '=' character, pg_dump would interpret it
as a libpq connection string instead of a plain database name.
To fix, pass the database name to pg_dump as a connection string,
"dbname=foo", with the database name escaped if necessary.
Back-patch to all supported branches.
Revert my earlier fix for the bug that unarchived WAL files get deleted on
crash recovery, commit c9cc7e05c6d82a9781883a016c70d95aa4923122. We create
a .done file for files streamed or restored from archive, so the WAL file
recycling logic used during normal operation works just as well during
archive recovery.
Per Fujii Masao's suggestion.
Since a backend adds itself to the global listener array during
Exec_ListenPreCommit, it's inappropriate for it to remove itself during
Exec_UnlistenCommit or Exec_UnlistenAllCommit --- that leads to failure
when committing a transaction that did UNLISTEN then LISTEN, since we end
up not registered though we should be. (This leads to missing later
notifications, or to Assert failures in assert-enabled builds.) Instead
deal with deregistering at the bottom of AtCommit_Notify, when we know the
final state of the listenChannels list.
Also, simplify the representation of registration status by replacing the
transient backendHasExecutedInitialListen flag with an amRegisteredListener
flag.
Per report from Greg Sabino Mullane. Back-patch to 9.0, where the problem
was introduced during the LISTEN/NOTIFY rewrite.
After further reflection I was unconvinced that the existing coding is
guaranteed to return valid union datums in every code path for multi-column
indexes. Fix that by forcing a gistunionsubkey() call at the end of the
recursion. Having done that, we can remove some clearly-redundant calls
elsewhere. This should be a little faster for multi-column indexes (since
the previous coding would uselessly do such a call for each column while
unwinding the recursion), as well as much harder to break.
Also, simplify the handling of cases where one side or the other of a
primary split contains only don't-care tuples. The previous coding used a
very ugly hack in removeDontCares() that essentially forced one random
tuple to be treated as non-don't-care, providing a random initial choice of
seed datum for the secondary split. It seems unlikely that that method
will give better-than-random splits. Instead, treat such a split as
degenerate and just let the next column determine the split, the same way
that we handle fully degenerate cases where the two sides produce identical
union datums.
This LOG message was put in over five years ago with the evident
expectation that we'd make all GiST opclasses support secondary split
directly. However, no such thing ever happened, and indeed the number of
opclasses supporting it decreased to zero in 9.2. The reason is that
improving on the default implementation isn't that easy --- the
opclass-specific code that did exist, before 9.2, doesn't appear to have
been any improvement over the default.
Hence, remove the message altogether. There's certainly no point in
nagging users about this in released branches, but I doubt that we'll
ever implement complete opclass-specific support anyway.
Not only is this implementation of secondary-split not better than the
default implementation in gistsplit.c, it's actually worse. The gistsplit.c
code at least looks to see if switching the left and right sides would make
a better merge with the previously-split tuples, while this doesn't.
In any case it's rather useless to support secondary split only in an edge
case. There used to be more complete support for it here (in chooseLR()),
but that was removed in commit 7f3bd86843e5aad84585a57d3f6b80db3c609916.
It appears to me though that the chooseLR() code was really isomorphic to
the default implementation, since it was still based on choosing the cheaper
way of adding two sub-split vectors that had been chosen without regard to
the primary split initially. I think an implementation of secondary split
that could beat the default implementation would have to be pretty fully
integrated into the split algorithm, not plastered on at the end.
Back-patch to 9.2, but not further; previous branches have the chooseLR()
code which I don't feel a great need to mess with. This is mainly so we
just have two behaviors and not three among the various branches (IOW, this
patch is cleanup for commit 7f3bd86843e5aad84585a57d3f6b80db3c609916's
incomplete removal of secondary-split support).
Improve comments, rename some variables and functions, slightly simplify
a couple of APIs, in an attempt to make this code readable by people other
than its original author.
Even though this is essentially just cosmetic, back-patch to all active
branches, because otherwise it's going to make back-patching future fixes
in this file very painful.
While there's considerable doubt that we want fuzzy behavior in the
geometric operators at all (let alone as currently implemented), nobody is
stepping forward to redesign that stuff. In the meantime it behooves us
to make sure that index searches agree with the behavior of the underlying
operators. This patch fixes two problems in this area.
First, gist_box_same was using fuzzy equality, but it really needs to use
exact equality to prevent not-quite-identical upper index keys from being
treated as identical, which for example would prevent an existing upper
key from being extended by an amount less than epsilon. This would result
in inconsistent indexes. (The next release notes will need to recommend
that users reindex GiST indexes on boxes, polygons, circles, and points,
since all four opclasses use gist_box_same.)
Second, gist_point_consistent used exact comparisons for upper-page
comparisons in ~= searches, when it needs to use fuzzy comparisons to
ensure it finds all matches; and it used fuzzy comparisons for point <@ box
searches, when it needs to use exact comparisons because that's what the
<@ operator (rather inconsistently) does.
The added regression test cases illustrate all three misbehaviors.
Back-patch to all active branches. (8.4 did not have GiST point_ops,
but it still seems prudent to apply the gist_box_same patch to it.)
Alexander Korotkov, reviewed by Noah Misch
Commit af7914c6627bcf0b0ca614e9ce95d3f8056602bf, which added the TIMING
option to EXPLAIN, had an oversight: if the TIMING option is disabled
then control in InstrStartNode() goes through an elog(DEBUG2) call, which
typically does nothing but takes a noticeable amount of time to do it.
Tweak the logic to avoid that.
In HEAD, also change the elog(DEBUG2)'s in instrument.c to elog(ERROR).
It's not very clear why they weren't like that to begin with, but this
episode shows that not complaining more vociferously about misuse is
likely to do little except allow bugs to remain hidden.
While at it, adjust some code that was making possibly-dangerous
assumptions about flag bits being in the rightmost byte of the
instrument_options word.
Problem reported by Pavel Stehule (via Tomas Vondra).
When considering a non-last column in a multi-column GiST index,
gistsplit.c tries to improve on the split chosen by the opclass-specific
pickSplit function by considering penalties for the next column. However,
there were two bugs in this code: it failed to recompute the union keys for
the leftmost index columns, even though these might well change after
reassigning tuples; and it included the old union keys in the recomputation
for the columns it did recompute, so that those keys couldn't get smaller
even if they should. The first problem could result in an invalid index
in which searches wouldn't find index entries that are in fact present;
the second would make the index less efficient to search.
Both of these errors were caused by misuse of gistMakeUnionItVec, whose
API was designed in a way that just begged such errors to be made. There
is no situation in which it's safe or useful to compute the union keys for
a subset of the index columns, and there is no caller that wants any
previous union keys to be included in the computation; so the undocumented
choice to treat the union keys as in/out rather than pure output parameters
is a waste of code as well as being dangerous.
Hence, rather than just making a minimal patch, I've changed the API of
gistMakeUnionItVec to remove the "startkey" parameter (it now always
processes all index columns) and treat the attr/isnull arrays as purely
output parameters.
In passing, also get rid of a couple of unnecessary and dangerous uses
of static variables in gistutil.c. It's remarkable that the one in
gistMakeUnionKey hasn't given us portability troubles before now, because
in addition to posing a re-entrancy hazard, it was unsafely assuming that
a static char[] array would have at least Datum alignment.
Per investigation of a trouble report from Tomas Vondra. (There are also
some bugs in contrib/btree_gist to be fixed, but that seems like material
for a separate patch.) Back-patch to all supported branches.
Normally, we suppress sending a tabstats message to the collector unless
there were some actual table stats to send. However, during backend exit
we should force out the message if there are any transaction commit/abort
counts to send, else the session's last few commit/abort counts will never
get reported at all. We had logic for this, but the short-circuit test
at the top of pgstat_report_stat() ignored the "force" flag, with the
consequence that session-ending transactions that touched no database-local
tables would not get counted. Seems to be an oversight in my commit
641912b4d17fd214a5e5bae4e7bb9ddbc28b144b, which added the "force" flag.
That was back in 8.3, so back-patch to all supported versions.
This function was misdeclared to take cstring when it should take internal.
This at least allows crashing the server, and in principle an attacker
might be able to use the function to examine the contents of server memory.
The correct fix is to adjust the system catalog contents (and fix the
regression tests that should have caught this but failed to). However,
asking users to correct the catalog contents in existing installations
is a pain, so as a band-aid fix for the back branches, install a check
in enum_recv() to make it throw error if called with a cstring argument.
We will later revert this in HEAD in favor of correcting the catalogs.
Our thanks to Sumit Soni (via Secunia SVCRP) for reporting this issue.
Security: CVE-2013-0255