1
0
mirror of https://github.com/postgres/postgres.git synced 2025-09-02 04:21:28 +03:00
Commit Graph

6173 Commits

Author SHA1 Message Date
Andres Freund
c7299d32f6 Backport "Expose fsync_fname as a public API".
Backport commit cc52d5b33f back to 9.1
to allow backpatching some unlogged table fixes that use fsync_fname.
2014-11-15 01:20:29 +01:00
Heikki Linnakangas
861d3aa43d Fix race condition between hot standby and restoring a full-page image.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.

To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.

In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.

Backpatch to all supported versions; this has been racy since hot standby
was introduced.
2014-11-13 20:01:55 +02:00
Tom Lane
c2b4ed19f6 Explicitly support the case that a plancache's raw_parse_tree is NULL.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.

Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL.  Also make it clear in the
relevant comments that NULL is an expected case.

This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions.  Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases.  The caller has more context and is better
able to decide what the empty-query case ought to do.

Per followup discussion of bug #11335.  Back-patch to 9.2.  The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
2014-11-12 15:58:44 -05:00
Tom Lane
07ab4ec4cf Ensure that whole-row Vars produce nonempty column names.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
However, that doesn't work terribly well: you could get the column names
of the underlying table, or column aliases from any level of the query,
depending on minor details of the plan tree.  You could even get totally
empty field names, which is disastrous for cases like row_to_json().

It seems unwise to change this behavior too much in stable branches,
however, since users might not have noticed that they weren't getting
the least-unintuitive choice of field names.  Therefore, in the back
branches, only change the results when the child plan has returned an
actually-empty field name.  (We assume that can't happen with a named
rowtype, so this also dodges the issue of possibly producing RECORD-typed
output from a Var with a named composite result type.)  As in the sister
patch for HEAD, we can get a better name to use from the Var's
corresponding RTE.  There is no need to touch the RowExpr code since it
was already using a copy of the RTE's alias list for RECORD cases.

Back-patch as far as 9.2.  Before that we did not have row_to_json()
so there were no core functions potentially affected by bogus field
names.  While 9.1 and earlier do have contrib's hstore(record) which
is also affected, those versions don't seem to produce empty field names
(at least not in the known problem cases), so we'll leave them alone.
2014-11-10 15:21:20 -05:00
Tom Lane
e65b550b37 Test IsInTransactionChain, not IsTransactionBlock, in vac_update_relstats.
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context.  We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
2014-10-30 13:03:28 -04:00
Andres Freund
d7624e5621 Flush unlogged table's buffers when copying or moving databases.
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.

That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.

This was reported in bug #10675 by Maxim Boguk.

Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.

Backpatch to 9.1 where unlogged tables were introduced.
2014-10-20 23:45:31 +02:00
Tom Lane
7804f350ee Declare mkdtemp() only if we're providing it.
Follow our usual style of providing an "extern" for a standard library
function only when we're also providing the implementation.  This avoids
issues when the system headers declare the function slightly differently
than we do, as noted by Caleb Welton.

We might have to go to the extent of probing to see if the system headers
declare the function, but let's not do that until it's demonstrated to be
necessary.

Oversight in commit 9e6b1bf258.  Back-patch
to all supported branches, as that was.
2014-10-17 22:55:27 -04:00
Tom Lane
137e7c1644 Support timezone abbreviations that sometimes change.
Up to now, PG has assumed that any given timezone abbreviation (such as
"EDT") represents a constant GMT offset in the usage of any particular
region; we had a way to configure what that offset was, but not for it
to be changeable over time.  But, as with most things horological, this
view of the world is too simplistic: there are numerous regions that have
at one time or another switched to a different GMT offset but kept using
the same timezone abbreviation.  Almost the entire Russian Federation did
that a few years ago, and later this month they're going to do it again.
And there are similar examples all over the world.

To cope with this, invent the notion of a "dynamic timezone abbreviation",
which is one that is referenced to a particular underlying timezone
(as defined in the IANA timezone database) and means whatever it currently
means in that zone.  For zones that use or have used daylight-savings time,
the standard and DST abbreviations continue to have the property that you
can specify standard or DST time and get that time offset whether or not
DST was theoretically in effect at the time.  However, the abbreviations
mean what they meant at the time in question (or most recently before that
time) rather than being absolutely fixed.

The standard abbreviation-list files have been changed to use this behavior
for abbreviations that have actually varied in meaning since 1970.  The
old simple-numeric definitions are kept for abbreviations that have not
changed, since they are a bit faster to resolve.

While this is clearly a new feature, it seems necessary to back-patch it
into all active branches, because otherwise use of Russian zone
abbreviations is going to become even more problematic than it already was.
This change supersedes the changes in commit 513d06ded et al to modify the
fixed meanings of the Russian abbreviations; since we've not shipped that
yet, this will avoid an undesirably incompatible (not to mention incorrect)
change in behavior for timestamps between 2011 and 2014.

This patch makes some cosmetic changes in ecpglib to keep its usage of
datetime lookup tables as similar as possible to the backend code, but
doesn't do anything about the increasingly obsolete set of timezone
abbreviation definitions that are hard-wired into ecpglib.  Whatever we
do about that will likely not be appropriate material for back-patching.
Also, a potential free() of a garbage pointer after an out-of-memory
failure in ecpglib has been fixed.

This patch also fixes pre-existing bugs in DetermineTimeZoneOffset() that
caused it to produce unexpected results near a timezone transition, if
both the "before" and "after" states are marked as standard time.  We'd
only ever thought about or tested transitions between standard and DST
time, but that's not what's happening when a zone simply redefines their
base GMT offset.

In passing, update the SGML documentation to refer to the Olson/zoneinfo/
zic timezone database as the "IANA" database, since it's now being
maintained under the auspices of IANA.
2014-10-16 15:22:17 -04:00
Tom Lane
b2b95de61e Fix some more problems with nested append relations.
As of commit a87c72915 (which later got backpatched as far as 9.1),
we're explicitly supporting the notion that append relations can be
nested; this can occur when UNION ALL constructs are nested, or when
a UNION ALL contains a table with inheritance children.

Bug #11457 from Nelson Page, as well as an earlier report from Elvis
Pranskevichus, showed that there were still nasty bugs associated with such
cases: in particular the EquivalenceClass mechanism could try to generate
"join" clauses connecting an appendrel child to some grandparent appendrel,
which would result in assertion failures or bogus plans.

Upon investigation I concluded that all current callers of
find_childrel_appendrelinfo() need to be fixed to explicitly consider
multiple levels of parent appendrels.  The most complex fix was in
processing of "broken" EquivalenceClasses, which are ECs for which we have
been unable to generate all the derived equality clauses we would like to
because of missing cross-type equality operators in the underlying btree
operator family.  That code path is more or less entirely untested by
the regression tests to date, because no standard opfamilies have such
holes in them.  So I wrote a new regression test script to try to exercise
it a bit, which turned out to be quite a worthwhile activity as it exposed
existing bugs in all supported branches.

The present patch is essentially the same as far back as 9.2, which is
where parameterized paths were introduced.  In 9.0 and 9.1, we only need
to back-patch a small fragment of commit 5b7b5518d, which fixes failure to
propagate out the original WHERE clauses when a broken EC contains constant
members.  (The regression test case results show that these older branches
are noticeably stupider than 9.2+ in terms of the quality of the plans
generated; but we don't really care about plan quality in such cases,
only that the plan not be outright wrong.  A more invasive fix in the
older branches would not be a good idea anyway from a plan-stability
standpoint.)
2014-10-01 19:30:30 -04:00
Andres Freund
855cabb6f5 Mark x86's memory barrier inline assembly as clobbering the cpu flags.
x86's memory barrier assembly was marked as clobbering "memory" but
not "cc" even though 'addl' sets various flags. As it turns out gcc on
x86 implicitly assumes "cc" on every inline assembler statement, so
it's not a bug. But as that's poorly documented and might get copied
to architectures or compilers where that's not the case, it seems
better to be precise.

Discussion: 20140919100016.GH4277@alap3.anarazel.de

To keep the code common, backpatch to 9.2 where explicit memory
barriers were introduced.
2014-09-19 17:13:52 +02:00
Andres Freund
bca91236a7 Fix typo in solaris spinlock fix.
07968dbfaa missed part of the S_UNLOCK define when building for
sparcv8+.
2014-09-09 23:37:33 +02:00
Andres Freund
27ef6b6530 Fix spinlock implementation for some !solaris sparc platforms.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.

Discussion: 20140630222854.GW26930@awork2.anarazel.de

Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
2014-09-09 23:37:33 +02:00
Heikki Linnakangas
a2a718b223 Treat 2PC commit/abort the same as regular xacts in recovery.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:

* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)

The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.

Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
2014-07-29 11:57:52 +03:00
Tom Lane
f7ba173cb3 Stamp 9.3.5. 2014-07-21 15:10:42 -04:00
Alvaro Herrera
9a28c3752c Have multixact be truncated by checkpoint, not vacuum
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time.  The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.

Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum.  Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.

Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.

While at it, update some comments that hadn't been updated since
multixacts were changed.

Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
2014-06-27 14:43:52 -04:00
Tom Lane
c1f8fb9bfb Avoid leaking memory while evaluating arguments for a table function.
ExecMakeTableFunctionResult evaluated the arguments for a function-in-FROM
in the query-lifespan memory context.  This is insignificant in simple
cases where the function relation is scanned only once; but if the function
is in a sub-SELECT or is on the inside of a nested loop, any memory
consumed during argument evaluation can add up quickly.  (The potential for
trouble here had been foreseen long ago, per existing comments; but we'd
not previously seen a complaint from the field about it.)  To fix, create
an additional temporary context just for this purpose.

Per an example from MauMau.  Back-patch to all active branches.
2014-06-19 22:13:47 -04:00
Noah Misch
841baf28f8 Add mkdtemp() to libpgport.
This function is pervasive on free software operating systems; import
NetBSD's implementation.  Back-patch to 8.4, like the commit that will
harness it.
2014-06-14 09:41:17 -04:00
Alvaro Herrera
167a2535f8 Wrap multixact/members correctly during extension, take 2
In a50d976254 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.

As reported by Serge Negodyuck in a followup to bug #8673.
2014-06-09 15:17:23 -04:00
Tom Lane
8960b2db51 Fix unportable setvbuf() usage in initdb.
In yesterday's commit 2dc4f011fd, I tried
to force buffering of stdout/stderr in initdb to be what it is by
default when the program is run interactively on Unix (since that's how
most manual testing is done).  This tripped over the fact that Windows
doesn't support _IOLBF mode.  We dealt with that a long time ago in
syslogger.c by falling back to unbuffered mode on Windows.  Export that
solution in port.h and use it in initdb.

Back-patch to 8.4, like the previous commit.
2014-05-15 15:57:57 -04:00
Heikki Linnakangas
48fc8962f9 Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.

To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.

Backpatch to all supported versions.
2014-05-15 16:58:02 +03:00
Bruce Momjian
04e15c69d2 Remove tabs after spaces in C comments
This was not changed in HEAD, but will be done later as part of a
pgindent run.  Future pgindent runs will also do this.

Report by Tom Lane

Backpatch through all supported branches, but not HEAD
2014-05-06 11:26:28 -04:00
Tom Lane
b72e90bc31 Fix failure to detoast fields in composite elements of structured types.
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row.  The same applies for ranges over composite types, nested
composites, etc.  However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites.  For example, given a command such
as

UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...

where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column.  If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced.  A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.

Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().

I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out).  The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.

This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple.  This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first.  With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.

The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before.  On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings.  Experimentation suggests that common SQL
coding patterns are unaffected either way, though.  Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.

In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.

Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled.  The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.

This bug has existed since TOAST was invented, so back-patch to all
supported branches.
2014-05-01 15:19:10 -04:00
Tom Lane
36825f38dd Can't completely get rid of #ifndef FRONTEND in palloc.h :-(
pg_controldata includes postgres.h not postgres_fe.h, so utils/palloc.h
must be able to compile in a "#define FRONTEND" context.  It appears that
Solaris Studio is smart enough to persuade us to define PG_USE_INLINE,
but not smart enough to not make a copy of unreferenced static functions;
which leads to an unsatisfied reference to CurrentMemoryContext.  So we
need an #ifndef FRONTEND around that declaration.  Per buildfarm.
2014-04-27 21:24:30 -04:00
Tom Lane
e4c1a496f2 Don't #include utils/palloc.h in common/fe_memutils.h.
This breaks the principle that common/ ought not depend on anything in the
server, not only code-wise but in the headers.  The only arguable advantage
is avoidance of duplication of half a dozen extern declarations, and even
that is rather dubious, considering that the previous coding was wrong
about which declarations to duplicate: it exposed pnstrdup() to frontend
code even though no such function is provided in fe_memutils.c.

On the same principle, don't #include utils/memutils.h in the frontend
build of psprintf.c.  This requires duplicating the definition of
MaxAllocSize, but that seems fine to me: there's no a-priori reason why
frontend code should use the same size limit as the backend anyway.

In passing, clean up some rather odd layout and ordering choices that
were imposed on palloc.h to reduce the number of #ifdefs required by
the previous approach.

Per gripe from Christoph Berg.  There's still more work to do to make
include/common/ clean, but this part seems reasonably noncontroversial.
2014-04-26 14:14:30 -04:00
Alvaro Herrera
c0bd128c81 Fix race when updating a tuple concurrently locked by another process
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.

The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple.  But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same.  This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.

This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.

To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.

Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates.  This should protect
against other bugs that might cause similar bogus situations.

Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced.  (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything.  Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)

Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com

Bug analysis by Andres Freund.
2014-04-24 15:41:55 -03:00
Tom Lane
cb651b624d Fix incorrect pg_proc.proallargtypes entries for two built-in functions.
pg_sequence_parameters() and pg_identify_object() have had incorrect
proallargtypes entries since 9.1 and 9.3 respectively.  This was mostly
masked by the correct information in proargtypes, but a few operations
such as pg_get_function_arguments() (and thus psql's \df display) would
show the wrong data types for these functions' input parameters.

In HEAD, fix the wrong info, bump catversion, and add an opr_sanity
regression test to catch future mistakes of this sort.

In the back branches, just fix the wrong info so that installations
initdb'd with future minor releases will have the right data.  We
can't force an initdb, and it doesn't seem like a good idea to add
a regression test that will fail on existing installations.

Andres Freund
2014-04-23 21:21:08 -04:00
Tom Lane
d359f71ac0 Fix non-equivalence of VARIADIC and non-VARIADIC function call formats.
For variadic functions (other than VARIADIC ANY), the syntaxes foo(x,y,...)
and foo(VARIADIC ARRAY[x,y,...]) should be considered equivalent, since the
former is converted to the latter at parse time.  They have indeed been
equivalent, in all releases before 9.3.  However, commit 75b39e790 made an
ill-considered decision to record which syntax had been used in FuncExpr
nodes, and then to make equal() test that in checking node equality ---
which caused the syntaxes to not be seen as equivalent by the planner.
This is the underlying cause of bug #9817 from Dmitry Ryabov.

It might seem that a quick fix would be to make equal() disregard
FuncExpr.funcvariadic, but the same commit made that untenable, because
the field actually *is* semantically significant for some VARIADIC ANY
functions.  This patch instead adopts the approach of redefining
funcvariadic (and aggvariadic, in HEAD) as meaning that the last argument
is a variadic array, whether it got that way by parser intervention or was
supplied explicitly by the user.  Therefore the value will always be true
for non-ANY variadic functions, restoring the principle of equivalence.
(However, the planner will continue to consider use of VARIADIC as a
meaningful difference for VARIADIC ANY functions, even though some such
functions might disregard it.)

In HEAD, this change lets us simplify the decompilation logic in
ruleutils.c, since the funcvariadic/aggvariadic flag tells directly whether
to print VARIADIC.  However, in 9.3 we have to continue to cope with
existing stored rules/views that might contain the previous definition.
Fortunately, this just means no change in ruleutils.c, since its existing
behavior effectively ignores funcvariadic for all cases other than VARIADIC
ANY functions.

In HEAD, bump catversion to reflect the fact that FuncExpr.funcvariadic
changed meanings; this is sort of pro forma, since I don't believe any
built-in views are affected.

Unfortunately, this patch doesn't magically fix everything for affected
9.3 users.  After installing 9.3.5, they might need to recreate their
rules/views/indexes containing variadic function calls in order to get
everything consistent with the new definition.  As in the cited bug,
the symptom of a problem would be failure to use a nominally matching
index that has a variadic function call in its definition.  We'll need
to mention this in the 9.3.5 release notes.
2014-04-03 22:02:27 -04:00
Heikki Linnakangas
767fc1c520 Avoid palloc in critical section in GiST WAL-logging.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.

There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.

The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
2014-04-03 15:44:42 +03:00
Tom Lane
65183fb78f Fix assorted issues in client host name lookup.
The code for matching clients to pg_hba.conf lines that specify host names
(instead of IP address ranges) failed to complain if reverse DNS lookup
failed; instead it silently didn't match, so that you might end up getting
a surprising "no pg_hba.conf entry for ..." error, as seen in bug #9518
from Mike Blackwell.  Since we don't want to make this a fatal error in
situations where pg_hba.conf contains a mixture of host names and IP
addresses (clients matching one of the numeric entries should not have to
have rDNS data), remember the lookup failure and mention it as DETAIL if
we get to "no pg_hba.conf entry".  Apply the same approach to forward-DNS
lookup failures, too, rather than treating them as immediate hard errors.

Along the way, fix a couple of bugs that prevented us from detecting an
rDNS lookup error reliably, and make sure that we make only one rDNS lookup
attempt; formerly, if the lookup attempt failed, the code would try again
for each host name entry in pg_hba.conf.  Since more or less the whole
point of this design is to ensure there's only one lookup attempt not one
per entry, the latter point represents a performance bug that seems
sufficient justification for back-patching.

Also, adjust src/port/getaddrinfo.c so that it plays as well as it can
with this code.  Which is not all that well, since it does not have actual
support for rDNS lookup, but at least it should return the expected (and
required by spec) error codes so that the main code correctly perceives the
lack of functionality as a lookup failure.  It's unlikely that PG is still
being used in production on any machines that require our getaddrinfo.c,
so I'm not excited about working harder than this.

To keep the code in the various branches similar, this includes
back-patching commits c424d0d105 and
1997f34db4 into 9.2 and earlier.

Back-patch to 9.1 where the facility for hostnames in pg_hba.conf was
introduced.
2014-04-02 17:11:27 -04:00
Tom Lane
d4f8dde3c1 Stamp 9.3.4. 2014-03-17 15:35:47 -04:00
Heikki Linnakangas
4f91af8ca2 Fix dangling smgr_owner pointer when a fake relcache entry is freed.
A fake relcache entry can "own" a SmgrRelation object, like a regular
relcache entry. But when it was free'd, the owner field in SmgrRelation
was not cleared, so it was left pointing to free'd memory.

Amazingly this apparently hasn't caused crashes in practice, or we would've
heard about it earlier. Andres found this with Valgrind.

Report and fix by Andres Freund, with minor modifications by me. Backpatch
to all supported versions.
2014-03-07 13:29:24 +02:00
Tom Lane
f557826f8d Avoid getting more than AccessShareLock when deparsing a query.
In make_ruledef and get_query_def, we have long used AcquireRewriteLocks
to ensure that the querytree we are about to deparse is up-to-date and
the schemas of the underlying relations aren't changing.  Howwever, that
function thinks the query is about to be executed, so it acquires locks
that are stronger than necessary for the purpose of deparsing.  Thus for
example, if pg_dump asks to deparse a rule that includes "INSERT INTO t",
we'd acquire RowExclusiveLock on t.  That results in interference with
concurrent transactions that might for example ask for ShareLock on t.
Since pg_dump is documented as being purely read-only, this is unexpected.
(Worse, it used to actually be read-only; this behavior dates back only
to 8.1, cf commit ba4200246.)

Fix this by adding a parameter to AcquireRewriteLocks to tell it whether
we want the "real" execution locks or only AccessShareLock.

Report, diagnosis, and patch by Dean Rasheed.  Back-patch to all supported
branches.
2014-03-06 19:31:09 -05:00
Tom Lane
f5f21315d2 Allow regex operations to be terminated early by query cancel requests.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time.  Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.

If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library.  But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.

A bigger issue is that there may be more places than these two where
we need to check for cancels.  We can always add more checks later,
now that the infrastructure is in place.

Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
2014-03-01 15:21:00 -05:00
Tom Lane
0691fe5047 Stamp 9.3.3. 2014-02-17 14:29:55 -05:00
Tom Lane
6fc2868e4f PGDLLIMPORT-ify MyBgworkerEntry.
This was done in HEAD in commit 7d7eee8bb7,
but 9.3 needs it too for contrib/worker_spi.  Per buildfarm member narwhal.
2014-02-17 11:29:34 -05:00
Noah Misch
7a362a176a Predict integer overflow to avoid buffer overruns.
Several functions, mostly type input functions, calculated an allocation
size such that the calculation wrapped to a small positive value when
arguments implied a sufficiently-large requirement.  Writes past the end
of the inadvertent small allocation followed shortly thereafter.
Coverity identified the path_in() vulnerability; code inspection led to
the rest.  In passing, add check_stack_depth() to prevent stack overflow
in related functions.

Back-patch to 8.4 (all supported versions).  The non-comment hstore
changes touch code that did not exist in 8.4, so that part stops at 9.0.

Noah Misch and Heikki Linnakangas, reviewed by Tom Lane.

Security: CVE-2014-0064
2014-02-17 09:33:32 -05:00
Noah Misch
e4a4fa2235 Fix handling of wide datetime input/output.
Many server functions use the MAXDATELEN constant to size a buffer for
parsing or displaying a datetime value.  It was much too small for the
longest possible interval output and slightly too small for certain
valid timestamp input, particularly input with a long timezone name.
The long input was rejected needlessly; the long output caused
interval_out() to overrun its buffer.  ECPG's pgtypes library has a copy
of the vulnerable functions, which bore the same vulnerabilities along
with some of its own.  In contrast to the server, certain long inputs
caused stack overflow rather than failing cleanly.  Back-patch to 8.4
(all supported versions).

Reported by Daniel Schüssler, reviewed by Tom Lane.

Security: CVE-2014-0063
2014-02-17 09:33:32 -05:00
Robert Haas
e1e0a4d791 Avoid repeated name lookups during table and index DDL.
If the name lookups come to different conclusions due to concurrent
activity, we might perform some parts of the DDL on a different table
than other parts.  At least in the case of CREATE INDEX, this can be
used to cause the permissions checks to be performed against a
different table than the index creation, allowing for a privilege
escalation attack.

This changes the calling convention for DefineIndex, CreateTrigger,
transformIndexStmt, transformAlterTableStmt, CheckIndexCompatible
(in 9.2 and newer), and AlterTable (in 9.1 and older).  In addition,
CheckRelationOwnership is removed in 9.2 and newer and the calling
convention is changed in older branches.  A field has also been added
to the Constraint node (FkConstraint in 8.4).  Third-party code calling
these functions or using the Constraint node will require updating.

Report by Andres Freund.  Patch by Robert Haas and Andres Freund,
reviewed by Tom Lane.

Security: CVE-2014-0062
2014-02-17 09:33:32 -05:00
Noah Misch
fc4a04a3c4 Prevent privilege escalation in explicit calls to PL validators.
The primary role of PL validators is to be called implicitly during
CREATE FUNCTION, but they are also normal functions that a user can call
explicitly.  Add a permissions check to each validator to ensure that a
user cannot use explicit validator calls to achieve things he could not
otherwise achieve.  Back-patch to 8.4 (all supported versions).
Non-core procedural language extensions ought to make the same two-line
change to their own validators.

Andres Freund, reviewed by Tom Lane and Noah Misch.

Security: CVE-2014-0061
2014-02-17 09:33:32 -05:00
Tom Lane
d5a43a238e PGDLLIMPORT'ify DateStyle and IntervalStyle.
This is needed on Windows to support contrib/postgres_fdw.  Although it's
been broken since last March, we didn't notice until recently because there
were no active buildfarm members that complained about missing PGDLLIMPORT
marking.  Efforts are underway to improve that situation, in support of
which we're delaying fixing some other cases of global variables that
should be marked PGDLLIMPORT.  However, this case affects 9.3, so we
can't wait any longer to fix it.

I chose to mark DateOrder as well, though it's not strictly necessary
for postgres_fdw.
2014-02-16 12:37:10 -05:00
Alvaro Herrera
fb47de2be6 Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.

Therefore, we now have multixact-specific freezing parameters:

vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)

vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids).  Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.

autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids).  This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.

Backpatch to 9.3 where multixacts were made to persist enough to require
freezing.  To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.

Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 19:30:30 -03:00
Andrew Dunstan
56c08df55b Enable building with Visual Studion 2013.
Backpatch to 9.3.

Brar Piening.
2014-01-26 09:45:43 -05:00
Tom Lane
ebde6c4014 Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted.  (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.)  In Hot Standby,
the WAL replay process must likewise lock every leaf page.  There were
several bugs in the code for that:

* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error.  This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages".  To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.

* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about).  Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.

* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page.  However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.

The first two of these bugs were diagnosed by Andres Freund, the other one
by me.  Fixes based on ideas from Heikki Linnakangas and myself.

This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-14 17:34:51 -05:00
Tom Lane
27ff4cfe76 Disallow LATERAL references to the target table of an UPDATE/DELETE.
On second thought, commit 0c051c9008 was
over-hasty: rather than allowing this case, we ought to reject it for now.
That leaves the field clear for a future feature that allows the target
table to be re-specified in the FROM (or USING) clause, which will enable
left-joining the target table to something else.  We can then also allow
LATERAL references to such an explicitly re-specified target table.
But allowing them right now will create ambiguities or worse for such a
feature, and it isn't something we documented 9.3 as supporting.

While at it, add a convenience subroutine to avoid having several copies
of the ereport for disalllowed-LATERAL-reference cases.
2014-01-11 19:03:15 -05:00
Alvaro Herrera
f40e79173a Handle wraparound during truncation in multixact/members
In pg_multixact/members, relying on modulo-2^32 arithmetic for
wraparound handling doesn't work all that well.  Because we don't
explicitely track wraparound of the allocation counter for members, it
is possible that the "live" area exceeds 2^31 entries; trying to remove
SLRU segments that are "old" according to the original logic might lead
to removal of segments still in use.  To fix, have the truncation
routine use a tailored SlruScanDirectory callback that keeps track of
the live area in actual use; that way, when the live range exceeds 2^31
entries, the oldest segments still live will not get removed untimely.

This new SlruScanDir callback needs to take care not to remove segments
that are "in the future": if new SLRU segments appear while the
truncation is ongoing, make sure we don't remove them.  This requires
examination of shared memory state to recheck for false positives, but
testing suggests that this doesn't cause a problem.  The original coding
didn't suffer from this pitfall because segments created when truncation
is running are never considered to be removable.

Per Andres Freund's investigation of bug #8673 reported by Serge
Negodyuck.
2014-01-02 18:16:54 -03:00
Alvaro Herrera
db1014bc46 Don't ignore tuple locks propagated by our updates
If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple.  This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).

This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134 it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.

It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.

Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
2013-12-18 13:31:27 -03:00
Alvaro Herrera
8e9a16ab8f Rework tuple freezing protocol
Tuple freezing was broken in connection to MultiXactIds; commit
8e53ae025d tried to fix it, but didn't go far enough.  As noted by
Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted
update might cause locks in the multi to go ignored by later
transactions.  This is because the code depended on a multixact above
their cutoff point not having any lock-only member older than the cutoff
point for Xids, which is easily defeated in READ COMMITTED transactions.

The fix for this involves creating a new MultiXactId when necessary.
But this cannot be done during WAL replay, and moreover multixact
examination requires using CLOG access routines which are not supposed
to be used during WAL replay either; so tuple freezing cannot be done
with the old freeze WAL record.  Therefore, separate the freezing
computation from its execution, and change the WAL record to carry all
necessary information.  At WAL replay time, it's easy to re-execute
freezing because we don't need to re-compute the new infomask/Xmax
values but just take them from the WAL record.

While at it, restructure the coding to ensure all page changes occur in
a single critical section without much room for failures.  The previous
coding wasn't using a critical section, without any explanation as to
why this was acceptable.

In replication scenarios using the 9.3 branch, standby servers must be
upgraded before their master, so that they are prepared to deal with the
new WAL record once the master is upgraded; failure to do so will cause
WAL replay to die with a PANIC message.  Later upgrade of the standby
will allow the process to continue where it left off, so there's no
disruption of the data in the standby in any case.  Standbys know how to
deal with the old WAL record, so it's okay to keep the master running
the old code for a while.

In master, the old freeze WAL record is gone, for cleanliness' sake;
there's no compatibility concern there.

Backpatch to 9.3, where the original bug was introduced and where the
previous fix was backpatched.

Álvaro Herrera and Andres Freund
2013-12-16 11:29:51 -03:00
Tom Lane
05ec931add Stamp 9.3.2. 2013-12-02 15:57:48 -05:00
Alvaro Herrera
215ac4ad65 Truncate pg_multixact/'s contents during crash recovery
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway.  Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it.  This has made not truncating the content of
pg_multixact/ not defensible anymore.

To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().

At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.

Back-patch to 9.0.  8.4 also has the problem, but since there's no hot
standby there, it's much less pressing.  In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required.  Add appropriate checks to make sure that's not
happening.

Andres Freund
2013-11-29 21:48:15 -03:00
Alvaro Herrera
f5f92bdc44 Fix full-table-vacuum request mechanism for MultiXactIds
While autovacuum dutifully launched anti-multixact-wraparound vacuums
when the multixact "age" was reached, the vacuum code was not aware that
it needed to make them be full table vacuums.  As the resulting
partial-table vacuums aren't capable of actually increasing relminmxid,
autovacuum continued to launch anti-wraparound vacuums that didn't have
the intended effect, until age of relfrozenxid caused the vacuum to
finally be a full table one via vacuum_freeze_table_age.

To fix, introduce logic for multixacts similar to that for plain
TransactionIds, using the same GUCs.

Backpatch to 9.3, where permanent MultiXactIds were introduced.

Andres Freund, some cleanup by Álvaro
2013-11-29 21:48:11 -03:00