This reverts commit c11130690d6dca64267201a169cfb38c1adec5ef in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.
This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)
Diagnosed by me and independently by Andres Freund. Back-patch to all
branches supporting hot standby.
If wal_level = hot_standby we update the checkpoint nextxid,
though in the case where a wraparound occurred half-way through
a checkpoint we would neglect updating the epoch also. Updating
the nextxid is arguably the wrong thing to do, but changing that
may introduce subtle bugs into hot standby startup, while updating
the value doesn't cause any known bugs yet. Minimal fix now to
HEAD and backbranches, wider fix later in HEAD.
Bug reported in #6291 by Daniel Farina and slightly differently in
Cause analysis and recommended fixes from Tom Lane and Andres Freund.
Applied patch is minimal version of Andres Freund's work.
client encoding and the client encoding is not *safe* one. Such an
example is, file encoding is UTF-8 and client encoding SJIS. Patch
contributed by Jiang Guiqing.
When we do "make install" to create a temp installation, we don't want
that instance of make to try to communicate with any instance of make
that might be calling us. This is known to cause problems if the
upper make has a -jN flag, and in principle could cause problems even
without that. Unset the relevant environment variables to prevent such
issues.
Andres Freund
The previous coding worked as long as MODULEDIR wasn't set explicitly,
because we create sharedir/$(datamoduledir) and the default value of
that is "extension". But if some other value is specified for MODULEDIR
then the installation directory needed for the control file wasn't made.
Cédric Villemain
If we had not been holding buffer pin continuously since the tuple was
initially fetched by the UPDATE or DELETE query, it would be possible for
VACUUM or a page-prune operation to move the tuple while we're trying to
copy it. This would result in a garbage "old" tuple value being passed to
an AFTER ROW UPDATE or AFTER ROW DELETE trigger. The preconditions for
this are somewhat improbable, and the timing constraints are very tight;
so it's not so surprising that this hasn't been reported from the field,
even though the bug has been there a long time.
Problem found by Andres Freund. Back-patch to all active branches.
The length of a socket path name is constrained by the size of struct
sockaddr_un, and there's not a lot we can do about it since that is a
kernel API. However, it would be a good thing if we produced an
intelligible error message when the user specifies a socket path that's too
long --- and getaddrinfo's standard API is too impoverished to do this in
the natural way. So insert explicit tests at the places where we construct
a socket path name. Now you'll get an error that makes sense and even
tells you what the limit is, rather than something generic like
"Non-recoverable failure in name resolution".
Per trouble report from Jeremy Drake and a fix idea from Andrew Dunstan.
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, make the final state be indisvalid = true and indisready = false,
which is otherwise nonsensical. This is pretty ugly but we can't add
another column without forcing initdb, and it's too late for that in 9.2.
(There's a cleaner fix in HEAD.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
Portions of this need to be back-patched even further, but I'll work
on that separately.
Problem reported by Amit Kapila, diagnosis by Pavan Deolasee,
fix by Tom Lane and Andres Freund.
This was apparently a typo, which caused recovery to think that it
immediately reached the end of backup, and allowed the database to start
up too early.
Reported by Jeff Janes. Backpatch to 9.2, where this code was introduced.
This reverts commit d573e239f03506920938bf0be56c868d9c3416da, "Take fewer
snapshots". While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query. This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra. While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.
In a query such as "SELECT DISTINCT min(x) FROM tab", the DISTINCT is
pretty useless (there being only one output row), but nonetheless it
shouldn't fail. But it could fail if "tab" is an inheritance parent,
because planagg.c's code for fixing up equivalence classes after making the
index-optimized MIN/MAX transformation wasn't prepared to find child-table
versions of the aggregate expression. The least ugly fix seems to be
to add an option to mutate_eclass_expressions() to skip child-table
equivalence class members, which aren't used anymore at this stage of
planning so it's not really necessary to fix them. Since child members
are ignored in many cases already, it seems plausible for
mutate_eclass_expressions() to have an option to ignore them too.
Per bug #7703 from Maxim Boguk.
Back-patch to 9.1. Although the same code exists before that, it cannot
encounter child-table aggregates AFAICS, because the index optimization
transformation cannot succeed on inheritance trees before 9.1 (for lack
of MergeAppend).
XLogRecPtrIsInvalid() only checks the xrecoff field, which is correct when
checking if a WAL record could legally begin at the given position, but WAL
sending can legally be paused at a page boundary, in which case xrecoff is
0. Use XLByteEQ(..., InvalidXLogRecPtr) instead, which checks that both
xlogid and xrecoff are 0.
9.3 doesn't have this problem because XLogRecPtr is now a single 64-bit
integer, so XLogRecPtrIsInvalid() does the right thing. Apply to 9.2, and
9.1 where pg_stat_replication view was introduced.
Kyotaro HORIGUCHI, reviewed by Fujii Masao.
Since we've already chdir'd into the data directory, the file should
be referenced as just "postmaster.pid", without prefixing the directory
path. This is harmless in the normal case where an absolute PGDATA path
is used, but quite dangerous if a relative path is specified, since the
program might then fail to notice an active postmaster.
Reported by Hari Babu. This got broken in my commit
eb5949d190e80360386113fde0f05854f0c9824d, so patch all active versions.
When startup process opens a WAL segment after replaying part of it, it
validates the first page on the WAL segment, even though the page it's
really interested in later in the file. As part of the validation, it checks
that the TLI on the page header is >= the TLI it saw on the last page it
read. If the segment contains a timeline switch, and we have already
replayed it, and then re-open the WAL segment (because of streaming
replication got disconnected and reconnected, for example), the TLI check
will fail when the first page is validated. Fix that by relaxing the TLI
check when re-opening a WAL segment.
Backpatch to 9.0. Earlier versions had the same code, but before standby
mode was introduced in 9.0, recovery never tried to re-read a segment after
partially replaying it.
Reported by Amit Kapila, while testing a new feature.
Once we've received a shutdown signal (SIGINT or SIGTERM), we should not
launch any more child processes, even if we get signals requesting such.
The normal code path for spawning backends has always understood that,
but the postmaster's infrastructure for hot standby and autovacuum didn't
get the memo. As reported by Hari Babu in bug #7643, this could lead to
failure to shut down at all in some cases, such as when SIGINT is received
just before the startup process sends PMSIGNAL_RECOVERY_STARTED: we'd
launch a bgwriter and checkpointer, and then those processes would have no
idea that they ought to quit. Similarly, launching a new autovacuum worker
would result in waiting till it finished before shutting down.
Also, switch the order of the code blocks in reaper() that detect startup
process crash versus shutdown termination. Once we've sent it a signal,
we should not consider that exit(1) is surprising. This is just a cosmetic
fix since shutdown occurs correctly anyway, but better not to log a phony
complaint about startup process crash.
Back-patch to 9.0. Some parts of this might be applicable before that,
but given the lack of prior complaints I'm not going to worry too much
about older branches.
Some platforms throw an exception for this division, rather than returning
a necessarily-overflowed result. Since we were testing for overflow after
the fact, an exception isn't nice. We can avoid the problem by treating
division by -1 as negation.
Add some regression tests so that we'll find out if any compilers try to
optimize away the overflow check conditions.
Back-patch of commit 1f7cb5c30983752ff8de833de30afcaee63536d0.
Per discussion with Xi Wang, though this is different from the patch he
submitted.
The previous definitions of these GUC variables allowed them to range
up to INT_MAX, but in point of fact the underlying code would suffer
overflows or other errors with large values. Reduce the maximum values
to something that won't misbehave. There's no apparent value in working
harder than this, since very large delays aren't sensible for any of
these. (Note: the risk with archive_timeout is that if we're late
checking the state, the timestamp difference it's being compared to
might overflow. So we need some amount of slop; the choice of INT_MAX/2
is arbitrary.)
Per followup investigation of bug #7670. Although this isn't a very
significant fix, might as well back-patch.
We need to avoid calling WaitLatch with timeouts exceeding INT_MAX.
Fortunately a simple clamp will do the trick, since no harm is done if
the wait times out before it's really time to rotate the log file.
Per bug #7670 (probably bug #7545 is the same thing, too).
In passing, fix bogus definition of log_rotation_age's maximum value in
guc.c --- it was numerically right, but only because MINS_PER_HOUR and
SECS_PER_MINUTE have the same value.
Back-patch to 9.2. Before that, syslogger wasn't using WaitLatch.
Traditionally check_partial_indexes() has only looked at restriction
clauses while trying to prove partial indexes usable in queries. However,
join clauses can also be used in some cases; mainly, that a strict operator
on "x" proves an "x IS NOT NULL" index predicate, even if the operator is
in a join clause rather than a restriction clause. Adding this code fixes
a regression in 9.2, because previously we would take join clauses into
account when considering whether a partial index could be used in a
nestloop inner indexscan path. 9.2 doesn't handle nestloop inner
indexscans in the same way, and this consideration was overlooked in the
rewrite. Moving the work to check_partial_indexes() is a better solution
anyway, since the proof applies whether or not we actually use the index
in that particular way, and we don't have to do it over again for each
possible outer relation. Per report from Dave Cramer.
The correct answer for this (or any other case with arg2 = -1) is zero,
but some machines throw a floating-point exception instead of behaving
sanely. Commit f9ac414c35ea084ff70c564ab2c32adb06d5296f dealt with this
in int4mod, but overlooked the fact that it also happens in int8mod
(at least on my Linux x86_64 machine). Protect int2mod as well; it's
not clear whether any machines fail there (mine does not) but since the
test is so cheap it seems better safe than sorry. While at it, simplify
the original guard in int4mod: we need only check for arg2 == -1, we
don't need to check arg1 explicitly.
Xi Wang, with some editing by me.
record_out() leaks memory: it fails to free the strings returned by the
per-column output functions, and also is careless about detoasted values.
This results in a query-lifespan memory leakage when returning composite
values to the client, because printtup() runs the output functions in the
query-lifespan memory context. Fix it to handle these issues the same way
printtup() does. Also fix a similar leakage in record_send().
(At some point we might want to try to run output functions in
shorter-lived memory contexts, so that we don't need a zero-leakage policy
for them. But that would be a significantly more invasive patch, which
doesn't seem like material for back-patching.)
In passing, use appendStringInfoCharMacro instead of appendStringInfoChar
in the innermost data-copying loop of record_out, to try to shave a few
cycles from this function's runtime.
Per trouble report from Carlos Henrique Reimer. Back-patch to all
supported versions.
At commit all standby locks are released
for the top-level transaction, so searching
for locks for each subtransaction is both
pointless and costly (N^2) in the presence
of many AccessExclusiveLocks.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
Since transformSetOperationTree() recurses, it can be driven to stack
overflow with enough UNION/INTERSECT/EXCEPT clauses in a query. Add a
check to ensure it fails cleanly instead of crashing. Per report from
Matthew Gerber (though it's not clear whether this is the only thing
going wrong for him).
Historical note: I think the reasoning behind not putting a check here in
the beginning was that the check in transformExpr() ought to be sufficient
to guard the whole parser. However, because transformSetOperationTree()
recurses all the way to the bottom of the set-operation tree before doing
any analysis of the statement's expressions, that check doesn't save it.
Some versions of the XSLT stylesheets don't handle the missing slash
correctly (they concatenate directory and file name without the slash).
This might never have worked correctly.
If the sleep is interrupted by a signal, we must recompute the remaining
time to wait; otherwise, a steady stream of non-wait-terminating interrupts
could delay return from WaitLatch indefinitely. This has been shown to be
a problem for the autovacuum launcher, and there may well be other places
now or in the future with similar issues. So we'd better make the function
robust, even though this'll add at least one gettimeofday call per wait.
Back-patch to 9.2. We might eventually need to fix 9.1 as well, but the
code is quite different there, and the usage of WaitLatch in 9.1 is so
limited that it's not clearly important to do so.
Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
heavily.
The trigger and rule cases need to split up the input name list, but
they mustn't corrupt the passed-in data structure, since it could be part
of a cached utility-statement parsetree. Per bug #7641.
Commit 4c9d0901 mistakenly introduced a call to
TransferPredicateLocksToHeapRelation() on an index relation that had
been closed a few lines above. Moving up an index_open() call that's
below is enough to fix the problem.
Discovered by me while testing an unrelated patch.
This case got broken in 8.4 by the addition of an error check that
complains if ALTER TABLE ONLY is used on a table that has children.
We do use ONLY for this situation, but it's okay because the necessary
recursion occurs at a higher level. So we need to have a separate
flag to suppress recursion without making the error check.
Reported and patched by Pavan Deolasee, with some editorial adjustments by
me. Back-patch to 8.4, since this is a regression of functionality that
worked in earlier branches.
I'm not sure why commit 1eb1dde049ccfffc42c80c2bcec14155c58bcc1f seems
to have made this start to fail on Cygwin when it never did before ---
but nonetheless, the coding was pretty bogus, and unlike the way we
handle $(X) anywhere else. Per buildfarm.
In bug #7626, Brian Dunavant exposes a performance problem created by
commit 3b8968f25232ad09001bf35ab4cc59f5a501193e: that commit attempted to
consider *all* possible combinations of indexable join clauses, but if said
clauses join to enough different relations, there's an exponential increase
in the number of outer-relation sets considered.
In Brian's example, all the clauses come from the same equivalence class,
which means it's redundant to use more than one of them in an indexscan
anyway. So we can prevent the problem in this class of cases (which is
probably the majority of real examples) by rejecting combinations that
would only serve to add a known-redundant clause.
But that still leaves us exposed to exponential growth of planning time
when the query has a lot of non-equivalence join clauses that are usable
with the same index. I chose to prevent such cases by setting an upper
limit on the number of relation sets considered, equal to ten times the
number of index clauses considered so far. (This sliding limit still
allows new relsets to be added on as we move to additional index columns,
which is probably more important than considering even more combinations of
clauses for the previous column.) This should keep the amount of work done
roughly linear rather than exponential in the apparent query complexity.
This part of the fix is pretty ad-hoc; but without a clearer idea of
real-world cases for which this would result in markedly inferior plans,
it's hard to see how to do better.
In its original conception, it was leaving some objects into the old
schema, but without their proper pg_depend entries; this meant that the
old schema could be dropped, causing future pg_dump calls to fail on the
affected database. This was originally reported by Jeff Frost as #6704;
there have been other complaints elsewhere that can probably be traced
to this bug.
To fix, be more consistent about altering a table's subsidiary objects
along the table itself; this requires some restructuring in how tables
are relocated when altering an extension -- hence the new
AlterTableNamespaceInternal routine which encapsulates it for both the
ALTER TABLE and the ALTER EXTENSION cases.
There was another bug lurking here, which was unmasked after fixing the
previous one: certain objects would be reached twice via the dependency
graph, and the second attempt to move them would cause the entire
operation to fail. Per discussion, it seems the best fix for this is to
do more careful tracking of objects already moved: we now maintain a
list of moved objects, to avoid attempting to do it twice for the same
object.
Authors: Alvaro Herrera, Dimitri Fontaine
Reviewed by Tom Lane
generate_base_implied_equalities_const() should prefer plain Consts over
other em_is_const eclass members when choosing the "pivot" value that
all the other members will be equated to. This makes it more likely that
the generated equalities will be useful in constraint-exclusion proofs.
Per report from Rushabh Lathia.
Represent a sequence's current value as a separate TableDataInfo dumpable
object, so that it can be dumped within the data section of the archive
rather than in pre-data. This fixes an undesirable inconsistency between
the meanings of "--data-only" and "--section=data", and also fixes dumping
of sequences that are marked as extension configuration tables, as per a
report from Marko Kreen back in July. The main cost is that we do one more
SQL query per sequence, but that's probably not very meaningful in most
databases.
Back-patch to 9.1, since it has the extension configuration issue even
though not the --section switch.
Views should not have any pg_attribute entries for system columns.
However, we forgot to remove such entries when converting a table to a
view. This could lead to crashes later on, if someone attempted to
reference such a column, as reported by Kohei KaiGai.
This problem is corrected properly in HEAD (by removing the pg_attribute
entries during conversion), but in the back branches we need to defend
against existing mis-converted views. This fix costs us an extra syscache
lookup per system column reference, which is annoying but probably not
really measurable in the big scheme of things.
For the non-concurrent case there is an AccessExclusiveLock lock
on both the index and the heap at a time during which no other
process is using either, before which the index is maintained and
used for scans, and after which the index is no longer used or
maintained. Predicate locks can safely be moved from the index to
the related heap relation under the protection of these locks.
This was done prior to the introductin of DROP INDEX CONCURRENTLY
and continues to be done for non-concurrent index drops.
For concurrent index drops, the predicate locks must be moved when
there are no index scans in progress on that index and no more can
subsequently start, and before heap inserts stop maintaining the
index. As long as these conditions are guaranteed when the
TransferPredicateLocksToHeapRelation() function is called,
stronger locks are not needed for correctness.
Kevin Grittner based on questions by Tom Lane in reviewing the
DROP INDEX CONCURRENTLY patch and in cooperation with Andres
Freund and Simon Riggs.
Back-patch of commit 4c9d0901f135d724a9f3cfa4140a5afd44b10f08
In commit 4317e0246c645f60c39e6572644cff1cb03b4c65, I accidentally broke
this behavior while rearranging code to ensure that --create wouldn't
affect whether a DATABASE entry gets put into archive-format output.
Thus, 9.2 would issue a DROP DATABASE command in --clean mode, which is
either useless or dangerous depending on the usage scenario.
It should not do that, and no longer does.
A bright spot is that this refactoring makes it easy to allow the
combination of --clean and --create to work sensibly, ie, emit DROP
DATABASE then CREATE DATABASE before reconnecting. Ordinarily we'd
consider that a feature addition and not back-patch it, but it seems
silly to not include the extra couple of lines required in the 9.2
version of the code.
Per report from Guillaume Lelarge, though this is slightly more extensive
than his proposed patch.
The code seems to have been written to handle the pre-parse-analysis
representation, where an ExecuteStmt would appear directly under
CreateTableAsStmt. But in reality the function is only run on
already-parse-analyzed statements, so there will be a Query node in
between. We'd not noticed the bug because the function is generally
not used at all except in extended query protocol.
Per report from Robert Haas and Rushabh Lathia.
An out-of-memory error during expand_table() on a palloc-based hash table
would leave a partially-initialized entry in the table. This would not be
harmful for transient hash tables, since they'd get thrown away anyway at
transaction abort. But for long-lived hash tables, such as the relcache
hash, this would effectively corrupt the table, leading to crash or other
misbehavior later.
To fix, rearrange the order of operations so that table enlargement is
attempted before we insert a new entry, rather than after adding it
to the hash table.
Problem discovered by Hitoshi Harada, though this is a bit different
from his proposed patch.