Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions). That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.
We could still lose control if an asynchronous SIGINT arrives just as the
function is entered. This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction. We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.
Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return. When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely. Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.
These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored. So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues. (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)
Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.
Per reports from Dan Wood.
Back-patch to 9.3 where the new timeout handling infrastructure was
introduced. We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
When acquiring a lock in fast-path mode, we must reset the locallock
object's lock and proclock fields to NULL. They are not necessarily that
way to start with, because the locallock could be left over from a failed
lock acquisition attempt earlier in the transaction. Failure to do this
led to all sorts of interesting misbehaviors when LockRelease tried to
clean up no-longer-related lock and proclock objects in shared memory.
Per report from Dan Wood.
In passing, modify LockRelease to elog not just Assert if it doesn't find
lock and proclock objects for a formerly fast-path lock, matching the code
in FastPathGetRelationLockEntry and LockRefindAndRelease. This isn't a
bug but it will help in diagnosing any future bugs in this area.
Also, modify FastPathTransferRelationLocks and FastPathGetRelationLockEntry
to break out of their loops over the fastpath array once they've found the
sole matching entry. This was inconsistently done in some search loops
and not others.
Improve assorted related comments, too.
Back-patch to 9.2 where the fast-path mechanism was introduced.
The previous coding labeled expressions such as pg_index.indkey[1:3] as
being of int2vector type; which is not right because the subscript bounds
of such a result don't, in general, satisfy the restrictions of int2vector.
To fix, implicitly promote the result of slicing int2vector to int2[],
or oidvector to oid[]. This is similar to what we've done with domains
over arrays, which is a good analogy because these types are very much
like restricted domains of the corresponding regular-array types.
A side-effect is that we now also forbid array-element updates on such
columns, eg while "update pg_index set indkey[4] = 42" would have worked
before if you were superuser (and corrupted your catalogs irretrievably,
no doubt) it's now disallowed. This seems like a good thing since, again,
some choices of subscripting would've led to results not satisfying the
restrictions of int2vector. The case of an array-slice update was
rejected before, though with a different error message than you get now.
We could make these cases work in future if we added a cast from int2[]
to int2vector (with a cast function checking the subscript restrictions)
but it seems unlikely that there's any value in that.
Per report from Ronan Dunklau. Back-patch to all supported branches
because of the crash risks involved.
Bug #8591 from Claudio Freire demonstrates that get_eclass_for_sort_expr
must be able to compute valid em_nullable_relids for any new equivalence
class members it creates. I'd worried about this in the commit message
for db9f0e1d9a, but claimed that it wasn't a
problem because multi-member ECs should already exist when it runs. That
is transparently wrong, though, because this function is also called by
initialize_mergeclause_eclasses, which runs during deconstruct_jointree.
The example given in the bug report (which the new regression test item
is based upon) fails because the COALESCE() expression is first seen by
initialize_mergeclause_eclasses rather than process_equivalence.
Fixing this requires passing the appropriate nullable_relids set to
get_eclass_for_sort_expr, and it requires new code to compute that set
for top-level expressions such as ORDER BY, GROUP BY, etc. We store
the top-level nullable_relids in a new field in PlannerInfo to avoid
computing it many times. In the back branches, I've added the new
field at the end of the struct to minimize ABI breakage for planner
plugins. There doesn't seem to be a good alternative to changing
get_eclass_for_sort_expr's API signature, though. There probably aren't
any third-party extensions calling that function directly; moreover,
if there are, they probably need to think about what to pass for
nullable_relids anyway.
Back-patch to 9.2, like the previous patch in this area.
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.
To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.
Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.
Backpatch to all supported versions.
Formerly, when using a SQL-spec timezone setting with a fixed GMT offset
(called a "brute force" timezone in the code), the session_timezone
variable was not updated to match the nominal timezone; rather, all code
was expected to ignore session_timezone if HasCTZSet was true. This is
of course obviously fragile, though a search of the code finds only
timeofday() failing to honor the rule. A bigger problem was that
DetermineTimeZoneOffset() supposed that if its pg_tz parameter was
pointer-equal to session_timezone, then HasCTZSet should override the
parameter. This would cause datetime input containing an explicit zone
name to be treated as referencing the brute-force zone instead, if the
zone name happened to match the session timezone that had prevailed
before installing the brute-force zone setting (as reported in bug #8572).
The same malady could affect AT TIME ZONE operators.
To fix, set up session_timezone so that it matches the brute-force zone
specification, which we can do using the POSIX timezone definition syntax
"<abbrev>offset", and get rid of the bogus lookaside check in
DetermineTimeZoneOffset(). Aside from fixing the erroneous behavior in
datetime parsing and AT TIME ZONE, this will cause the timeofday() function
to print its result in the user-requested time zone rather than some
previously-set zone. It might also affect results in third-party
extensions, if there are any that make use of session_timezone without
considering HasCTZSet, but in all cases the new behavior should be saner
than before.
Back-patch to all supported branches.
If a tuple was frozen while its predicate locks mattered,
read-write dependencies could be missed, resulting in failure to
detect conflicts which could lead to anomalies in committed
serializable transactions.
This field was added to the tag when we still thought that it was
necessary to carry locks forward to a new version of an updated
row. That was later proven to be unnecessary, which allowed
simplification of the code, but elimination of xmin from the tag
was missed at the time.
Per report and analysis by Heikki Linnakangas.
Backpatch to 9.1.
It seems to make more sense to use "cutoff multixact" terminology
throughout the backend code; "freeze" is associated with replacing of an
Xid with FrozenTransactionId, which is not what we do for MultiXactIds.
Andres Freund
Some adjustments by Álvaro Herrera
Failing to do so can cause queries to return wrong data, error out or crash.
This requires adding a new binaryheap_reset() method to binaryheap.c,
but that probably should have been there anyway.
Per bug #8410 from Terje Elde. Diagnosis and patch by Andres Freund.
This change will only apply to mingw compilers, and has been found
necessary by late versions of the mingw-w64 compiler. It's the same as
what is done elsewhere for the Microsoft compilers.
Backpatch of commit 73838b5251.
Problem reported by Michael Cronenworth, although not his patch.
It's possible that inlining of SQL functions (or perhaps other changes?)
has exposed typmod information not known at parse time. In such cases,
Vars generated by query_planner might have valid typmod values while the
original grouping columns only have typmod -1. This isn't a semantic
problem since the behavior of grouping only depends on type not typmod,
but it breaks locate_grouping_columns' use of tlist_member to locate the
matching entry in query_planner's result tlist.
We can fix this without an excessive amount of new code or complexity by
relying on the fact that locate_grouping_columns only gets called when
make_subplanTargetList has set need_tlist_eval == false, and that can only
happen if all the grouping columns are simple Vars. Therefore we only need
to search the sub_tlist for a matching Var, and we can reasonably define a
"match" as being a match of the Var identity fields
varno/varattno/varlevelsup. The code still Asserts that vartype matches,
but ignores vartypmod.
Per bug #8393 from Evan Martin. The added regression test case is
basically the same as his example. This has been broken for a very long
time, so back-patch to all supported branches.
When upgrading from servers of versions 9.2 and older, and MultiXactIds
have been used in the old server beyond the first page (that is, 2048
multis or more in the default 8kB-page build), pg_upgrade would set the
next multixact offset to use beyond what has been allocated in the new
cluster. This would cause a failure the first time the new cluster
needs to use this value, because the pg_multixact/offsets/ file wouldn't
exist or wouldn't be large enough. To fix, ensure that the transient
server instances launched by pg_upgrade extend the file as necessary.
Per report from Jesse Denardo in
CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
The planner largely failed to consider the possibility that a
PlaceHolderVar's expression might contain a lateral reference to a Var
coming from somewhere outside the PHV's syntactic scope. We had a previous
report of a problem in this area, which I tried to fix in a quick-hack way
in commit 4da6439bd8, but Antonin Houska
pointed out that there were still some problems, and investigation turned
up other issues. This patch largely reverts that commit in favor of a more
thoroughly thought-through solution. The new theory is that a PHV's
ph_eval_at level cannot be higher than its original syntactic level. If it
contains lateral references, those don't change the ph_eval_at level, but
rather they create a lateral-reference requirement for the ph_eval_at join
relation. The code in joinpath.c needs to handle that.
Another issue is that createplan.c wasn't handling nested PlaceHolderVars
properly.
In passing, push knowledge of lateral-reference checks for join clauses
into join_clause_is_movable_to. This is mainly so that FDWs don't need
to deal with it.
This patch doesn't fix the original join-qual-placement problem reported by
Jeremy Evans (and indeed, one of the new regression test cases shows the
wrong answer because of that). But the PlaceHolderVar problems need to be
fixed before that issue can be addressed, so committing this separately
seems reasonable.
The planner logic that attempted to make a preliminary estimate of the
ph_needed levels for PlaceHolderVars seems to be completely broken by
lateral references. Fortunately, the potential join order optimization
that this code supported seems to be of relatively little value in
practice; so let's just get rid of it rather than trying to fix it.
Getting rid of this allows fairly substantial simplifications in
placeholder.c, too, so planning in such cases should be a bit faster.
Issue noted while pursuing bugs reported by Jeremy Evans and Antonin
Houska, though this doesn't in itself fix either of their reported cases.
What this does do is prevent an Assert crash in the kind of query
illustrated by the added regression test. (I'm not sure that the plan for
that query is stable enough across platforms to be usable as a regression
test output ... but we'll soon find out from the buildfarm.)
Back-patch to 9.3. The problem case can't arise without LATERAL, so
no need to touch older branches.
We've seen multiple cases of people looking at the postmaster's original
stderr output to try to diagnose problems, not realizing/remembering that
their logging configuration is set up to send log messages somewhere else.
This seems particularly likely to happen in prepackaged distributions,
since many packagers patch the code to change the factory-standard logging
configuration to something more in line with their platform conventions.
In hopes of reducing confusion, emit a LOG message about this at the point
in startup where we are about to switch log output away from the original
stderr, providing a pointer to where to look instead. This message will
appear as the last thing in the original stderr output. (We might later
also try to emit such link messages when logging parameters are changed
on-the-fly; but that case seems to be both noticeably harder to do nicely,
and much less frequently a problem in practice.)
Per discussion, back-patch to 9.3 but not further.
My tweak of these error messages in commit c359a1b082 contained the
thinko that a query would always have rowMarks set for a query
containing a locking clause. Not so: when declaring a cursor, for
instance, rowMarks isn't set at the point we're checking, so we'd be
dereferencing a NULL pointer.
The fix is to pass the lock strength to the function raising the error,
instead of trying to reverse-engineer it. The result not only is more
robust, but it also seems cleaner overall.
Per report from Robert Haas.
plpgsql often just remembers SPI-result tuple tables in local variables,
and has no mechanism for freeing them if an ereport(ERROR) causes an escape
out of the execution function whose local variable it is. In the original
coding, that wasn't a problem because the tuple table would be cleaned up
when the function's SPI context went away during transaction abort.
However, once plpgsql grew the ability to trap exceptions, repeated
trapping of errors within a function could result in significant
intra-function-call memory leakage, as illustrated in bug #8279 from
Chad Wagner.
We could fix this locally in plpgsql with a bunch of PG_TRY/PG_CATCH
coding, but that would be tedious, probably slow, and prone to bugs of
omission; moreover it would do nothing for similar risks elsewhere.
What seems like a better plan is to make SPI itself responsible for
freeing tuple tables at subtransaction abort. This patch attacks the
problem that way, keeping a list of live tuple tables within each SPI
function context. Currently, such freeing is automatic for tuple tables
made within the failed subtransaction. We might later add a SPI call to
mark a tuple table as not to be freed this way, allowing callers to opt
out; but until someone exhibits a clear use-case for such behavior, it
doesn't seem worth bothering.
A very useful side-effect of this change is that SPI_freetuptable() can
now defend itself against bad calls, such as duplicate free requests;
this should make things more robust in many places. (In particular,
this reduces the risks involved if a third-party extension contains
now-redundant SPI_freetuptable() calls in error cleanup code.)
Even though the leakage problem is of long standing, it seems imprudent
to back-patch this into stable branches, since it does represent an API
semantics change for SPI users. We'll patch this in 9.3, but live with
the leakage in older branches.
Previously one had to use slist_delete(), implying an additional scan of
the list, making this infrastructure considerably less efficient than
traditional Lists when deletion of element(s) in a long list is needed.
Modify the slist_foreach_modify() macro to support deleting the current
element in O(1) time, by keeping a "prev" pointer in addition to "cur"
and "next". Although this makes iteration with this macro a bit slower,
no real harm is done, since in any scenario where you're not going to
delete the current list element you might as well just use slist_foreach
instead. Improve the comments about when to use each macro.
Back-patch to 9.3 so that we'll have consistent semantics in all branches
that provide ilist.h. Note this is an ABI break for callers of
slist_foreach_modify().
Andres Freund and Tom Lane
It's possible to drop a column from an input table of a JOIN clause in a
view, if that column is nowhere actually referenced in the view. But it
will still be there in the JOIN clause's joinaliasvars list. We used to
replace such entries with NULL Const nodes, which is handy for generation
of RowExpr expansion of a whole-row reference to the view. The trouble
with that is that it can't be distinguished from the situation after
subquery pull-up of a constant subquery output expression below the JOIN.
Instead, replace such joinaliasvars with null pointers (empty expression
trees), which can't be confused with pulled-up expressions. expandRTE()
still emits the old convention, though, for convenience of RowExpr
generation and to reduce the risk of breaking extension code.
In HEAD and 9.3, this patch also fixes a problem with some new code in
ruleutils.c that was failing to cope with implicitly-casted joinaliasvars
entries, as per recent report from Feike Steenbergen. That oversight was
because of an inadequate description of the data structure in parsenodes.h,
which I've now corrected. There were some pre-existing oversights of the
same ilk elsewhere, which I believe are now all fixed.
In commit 0ac5ad5134 I changed some error messages from "FOR
UPDATE/SHARE" to a rather long gobbledygook which nobody liked. Then,
in commit cb9b66d31 I changed them again, but the alternative chosen
there was deemed suboptimal by Peter Eisentraut, who in message
1373937980.20441.8.camel@vanquo.pezone.net proposed an alternative
involving a dynamically-constructed string based on the actual locking
strength specified in the SQL command. This patch implements that
suggestion.
Commit 7f7485a0cd made these changes
in master; per discussion, backport the API changes (but not the
functional changes), so that people don't get used to the 9.3 API
only to see it get broken in the next release. There are already
some people coding to the original 9.3 API, and this will cause
minor breakage, but there will be even more if we wait until next
year to roll out these changes.
Per discussion on pgsql-hackers, these aren't really needed. Interim
versions of the background worker patch had the worker starting with
signals already unblocked, which would have made this necessary.
But the final version does not, so we don't really need it; and it
doesn't work well with the new facility for starting dynamic background
workers, so just rip it out.
Also per discussion on pgsql-hackers, back-patch this change to 9.3.
It's best to get the API break out of the way before we do an
official release of this facility, to avoid more pain for extension
authors later.
The new JSON API uses a bit of an unusual typedef scheme, where for
example OkeysState is a pointer to okeysState. And that's not applied
consistently either. Change that to the more usual PostgreSQL style
where struct typedefs are upper case, and use pointers explicitly.
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.
Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.
In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
Back-patch to 9.3.
SP-GiST's original scheme for avoiding deadlocks during concurrent index
insertions doesn't work, as per report from Hailong Li, and there isn't any
evident way to make it work completely. We could possibly lock individual
inner tuples instead of their whole pages, but preliminary experimentation
suggests that the performance penalty would be huge. Instead, if we fail
to get a buffer lock while descending the tree, just restart the tree
descent altogether. We keep the old tuple positioning rules, though, in
hopes of reducing the number of cases where this can happen.
Teodor Sigaev, somewhat edited by Tom Lane
pg_filedump and other external utility programs are likely to want to be
able to check Postgres page checksums. To avoid messy duplication of code,
move the checksumming functionality into an exported header file, much as
we did awhile back for the CRC code.
In passing, get rid of an unportable assumption that a static char[] array
will be word-aligned, and do some other minor code beautification.
Extend the FDW API (which we already changed for 9.3) so that an FDW can
report whether specific foreign tables are insertable/updatable/deletable.
The default assumption continues to be that they're updatable if the
relevant executor callback function is supplied by the FDW, but finer
granularity is now possible. As a test case, add an "updatable" option to
contrib/postgres_fdw.
This patch also fixes the information_schema views, which previously did
not think that foreign tables were ever updatable, and fixes
view_is_auto_updatable() so that a view on a foreign table can be
auto-updatable.
initdb forced due to changes in information_schema views and the functions
they rely on. This is a bit unfortunate to do post-beta1, but if we don't
change this now then we'll have another API break for FDWs when we do
change it.
Dean Rasheed, somewhat editorialized on by Tom Lane
Use the same gcc atomic functions as we do on newer ARM chips.
(Basically this is a copy and paste of the __arm__ code block,
but omitting the SWPB option since that definitely won't work.)
Back-patch to 9.2. The patch would work further back, but we'd also
need to update config.guess/config.sub in older branches to make them
build out-of-the-box, and there hasn't been demand for it.
Mark Salter
This reverts commit a475c60367.
Erik Rijkers reported back in January 2013 that after the patch, if you do
"pg_dump -t myschema.mytable" to dump a single table, and restore that in
a database where myschema does not exist, the table is silently created in
pg_catalog instead. That is because pg_dump uses
"SET search_path=myschema, pg_catalog" to set schema the table is created
in. While allow_system_table_mods is not a very elegant solution to this,
we can't leave it as it is, so for now, revert it back to the way it was
previously.
What we have implemented is a radix tree (or a radix trie or a patricia
trie), but the docs and code comments incorrectly called it a "suffix tree".
Alexander Korotkov
Previously this state was represented by whether the view's disk file had
zero or nonzero size, which is problematic for numerous reasons, since it's
breaking a fundamental assumption about heap storage. This was done to
allow unlogged matviews to revert to unpopulated status after a crash
despite our lack of any ability to update catalog entries post-crash.
However, this poses enough risk of future problems that it seems better to
not support unlogged matviews until we can find another way. Accordingly,
revert that choice as well as a number of existing kluges forced by it
in favor of creating a pg_class.relispopulated flag column.
This patch gets rid of the concept of, and infrastructure for,
non-canonical PathKeys; we now only ever create canonical pathkey lists.
The need for non-canonical pathkeys came from the desire to have
grouping_planner initialize query_pathkeys and related pathkey lists before
calling query_planner. However, since query_planner didn't actually *do*
anything with those lists before they'd been made canonical, we can get rid
of the whole mess by just not creating the lists at all until the point
where we formerly canonicalized them.
There are several ways in which we could implement that without making
query_planner itself deal with grouping/sorting features (which are
supposed to be the province of grouping_planner). I chose to add a
callback function to query_planner's API; other alternatives would have
required adding more fields to PlannerInfo, which while not bad in itself
would create an ABI break for planner-related plugins in the 9.2 release
series. This still breaks ABI for anything that calls query_planner
directly, but it seems somewhat unlikely that there are any such plugins.
I had originally conceived of this change as merely a step on the way to
fixing bug #8049 from Teun Hoogendoorn; but it turns out that this fixes
that bug all by itself, as per the added regression test. The reason is
that now get_eclass_for_sort_expr is adding the ORDER BY expression at the
end of EquivalenceClass creation not the start, and so anything that is in
a multi-member EquivalenceClass has already been created with correct
em_nullable_relids. I am suspicious that there are related scenarios in
which we still need to teach get_eclass_for_sort_expr to compute correct
nullable_relids, but am not eager to risk destabilizing either 9.2 or 9.3
to fix bugs that are only hypothetical. So for the moment, do this and
stop here.
Back-patch to 9.2 but not to earlier branches, since they don't exhibit
this bug for lack of join-clause-movement logic that depends on
em_nullable_relids being correct. (We might have to revisit that choice
if any related bugs turn up.) In 9.2, don't change the signature of
make_pathkeys_for_sortclauses nor remove canonicalize_pathkeys, so as
not to risk more plugin breakage than we have to.
Isolate checksum calculation to its own module, so that bufpage
knows little if anything about the details of the calculation.
This implementation is a modified FNV-1a hash checksum, details
of which are given in the new checksum.c header comments.
Basic implementation only, so we fix the output value.
Later related commits will add version numbers to pg_control,
compiler optimization flags and memory barriers.
Ants Aasma, reviewed by Jeff Davis and Simon Riggs
Choose a saner ordering of parameters (adding a new input param after
the output params seemed a bit random), update the function's header
comment to match reality (cmon folks, is this really that hard?),
get rid of useless and sloppily-defined distinction between
PROCESS_UTILITY_SUBCOMMAND and PROCESS_UTILITY_GENERATED.
Move checking for unscannable matviews into ExecOpenScanRelation, which is
a better place for it first because the open relation is already available
(saving a relcache lookup cycle), and second because this eliminates the
problem of telling the difference between rangetable entries that will or
will not be scanned by the query. In particular we can get rid of the
not-terribly-well-thought-out-or-implemented isResultRel field that the
initial matviews patch added to RangeTblEntry.
Also get rid of entirely unnecessary scannability check in the rewriter,
and a bogus decision about whether RefreshMatViewStmt requires a parse-time
snapshot.
catversion bump due to removal of a RangeTblEntry field, which changes
stored rules.
In most cases, these were just references to the SQL standard in
general. In a few cases, a contrast was made between SQL92 and later
standards -- those have been kept unchanged.