Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
This warning is new in gcc 4.6 and part of -Wall. This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.
In nearly all cases, the caller already knows the correct collation, and
in a number of places, the value the caller has handy is more correct than
the default for the type would be. (In particular, this patch makes it
significantly less likely that eval_const_expressions will result in
changing the exposed collation of an expression.) So an internal lookup
is both expensive and wrong.
Ensure that parameter symbols receive collation from the function's
resolved input collation, and fix inlining to behave properly.
BTW, this commit lays about 90% of the infrastructure needed to support
use of argument names in SQL functions. Parsing of parameters is now
done via the parser-hook infrastructure ... we'd just need to supply
a column-ref hook ...
All expression nodes now have an explicit output-collation field, unless
they are known to only return a noncollatable data type (such as boolean
or record). Also, nodes that can invoke collation-aware functions store
a separate field that is the collation value to pass to the function.
This avoids confusion that arises when a function has collatable inputs
and noncollatable output type, or vice versa.
Also, replace the parser's on-the-fly collation assignment method with
a post-pass over the completed expression tree. This allows us to use
a more complex (and hopefully more nearly spec-compliant) assignment
rule without paying for it in extra storage in every expression node.
Fix assorted bugs in the planner's handling of collations by making
collation one of the defining properties of an EquivalenceClass and
by converting CollateExprs into discardable RelabelType nodes during
expression preprocessing.
CollateClause is now used only in raw grammar output, and CollateExpr after
parse analysis. This is for clarity and to avoid carrying collation names
in post-analysis parse trees: that's both wasteful and possibly misleading,
since the collation's name could be changed while the parsetree still
exists.
Also, clean up assorted infelicities and omissions in processing of the
node type.
With this patch, portals, SQL functions, and SPI all agree that there
should be only a CommandCounterIncrement between the queries that are
generated from a single SQL command by rule expansion. Fetching a whole
new snapshot now happens only between original queries. This is equivalent
to the existing behavior of EXPLAIN ANALYZE, and it was judged to be the
best choice since it eliminates one source of concurrency hazards for
rules. The patch should also make things marginally faster by reducing the
number of snapshot push/pop operations.
The patch removes pg_parse_and_rewrite(), which is no longer used anywhere.
There was considerable discussion about more aggressive refactoring of the
query-processing functions exported by postgres.c, but for the moment
nothing more has been done there.
I also took the opportunity to refactor snapmgr.c's API slightly: the
former PushUpdatedSnapshot() has been split into two functions.
Marko Tiikkaja, reviewed by Steve Singer and Tom Lane
The originally committed patch for modifying CTEs didn't interact well
with EXPLAIN, as noted by myself, and also had corner-case problems with
triggers, as noted by Dean Rasheed. Those problems show it is really not
practical for ExecutorEnd to call any user-defined code; so split the
cleanup duties out into a new function ExecutorFinish, which must be called
between the last ExecutorRun call and ExecutorEnd. Some Asserts have been
added to these functions to help verify correct usage.
It is no longer necessary for callers of the executor to call
AfterTriggerBeginQuery/AfterTriggerEndQuery for themselves, as this is now
done by ExecutorStart/ExecutorFinish respectively. If you really need to
suppress that and do it for yourself, pass EXEC_FLAG_SKIP_TRIGGERS to
ExecutorStart.
Also, refactor portal commit processing to allow for the possibility that
PortalDrop will invoke user-defined code. I think this is not actually
necessary just yet, since the portal-execution-strategy logic forces any
non-pure-SELECT query to be run to completion before we will consider
committing. But it seems like good future-proofing.
We need ExecutorEnd to run the ModifyTable nodes to completion in
reverse order of initialization, not forward order. Easily done
by constructing the list back-to-front.
This patch implements data-modifying WITH queries according to the
semantics that the updates all happen with the same command counter value,
and in an unspecified order. Therefore one WITH clause can't see the
effects of another, nor can the outer query see the effects other than
through the RETURNING values. And attempts to do conflicting updates will
have unpredictable results. We'll need to document all that.
This commit just fixes the code; documentation updates are waiting on
author.
Marko Tiikkaja and Hitoshi Harada
This was a leftover from the pre-8.1 design of junkfilters. It doesn't
seem to have any reason to live, since it's merely a combination of two
easy function calls, and not a well-designed combination at that (it
encourages callers to leak the result tuple).
ExecUpdate checked for whether ExecBRUpdateTriggers had returned a new
tuple value by seeing if the returned tuple was pointer-equal to the old
one. But the "old one" was in estate->es_junkFilter's result slot, which
would be scribbled on if we had done an EvalPlanQual update in response to
a concurrent update of the target tuple; therefore we were comparing a
dangling pointer to a live one. Given the right set of circumstances we
could get a false match, resulting in not forcing the tuple to be stored in
the slot we thought it was stored in. In the case reported by Maxim Boguk
in bug #5798, this led to "cannot extract system attribute from virtual
tuple" failures when trying to do "RETURNING ctid". I believe there is a
very-low-probability chance of more serious errors, such as generating
incorrect index entries based on the original rather than the
trigger-modified version of the row.
In HEAD, change all of ExecBRInsertTriggers, ExecIRInsertTriggers,
ExecBRUpdateTriggers, and ExecIRUpdateTriggers so that they continue to
have similar APIs. In the back branches I just changed
ExecBRUpdateTriggers, since there is no bug in the ExecBRInsertTriggers
case.
This commit provides the core code and documentation needed. A contrib
module test case will follow shortly.
Shigeru Hanada, Jan Urbanski, Heikki Linnakangas
Flattening of subquery range tables during setrefs.c could lead to the
rangetable indexes in PlanRowMark nodes not matching up with the column
names previously assigned to the corresponding resjunk ctid (resp. tableoid
or wholerow) columns. Typical symptom would be either a "cannot extract
system attribute from virtual tuple" error or an Assert failure. This
wasn't a problem before 9.0 because we didn't support FOR UPDATE below the
top query level, and so the final flattening could never renumber an RTE
that was relevant to FOR UPDATE. Fix by using a plan-tree-wide unique
number for each PlanRowMark to label the associated resjunk columns, so
that the number need not change during flattening.
Per report from David Johnston (though I'm darned if I can see how this got
past initial testing of the relevant code). Back-patch to 9.0.
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.
Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
In an inherited UPDATE/DELETE, each target table has its own subplan,
because it might have a column set different from other targets. This
means that the resjunk columns we add to support EvalPlanQual might be
at different physical column numbers in each subplan. The EvalPlanQual
rewrite I did for 9.0 failed to account for this, resulting in possible
misbehavior or even crashes during concurrent updates to the same row,
as seen in a recent report from Gordon Shannon. Revise the data structure
so that we track resjunk column numbers separately for each subplan.
I also chose to move responsibility for identifying the physical column
numbers back to executor startup, instead of assuming that numbers derived
during preprocess_targetlist would stay valid throughout subsequent
massaging of the plan. That's a bit slower, so we might want to consider
undoing it someday; but it would complicate the patch considerably and
didn't seem justifiable in a bug fix that has to be back-patched to 9.0.
Foreign tables are a core component of SQL/MED. This commit does
not provide a working SQL/MED infrastructure, because foreign tables
cannot yet be queried. Support for foreign table scans will need to
be added in a future patch. However, this patch creates the necessary
system catalog structure, syntax support, and support for ancillary
operations such as COMMENT and SECURITY LABEL.
Shigeru Hanada, heavily revised by Robert Haas
This is advantageous first because it allows us to hash the smaller table
regardless of the outer-join type, and second because hash join can be more
flexible than merge join in dealing with arbitrary join quals in a FULL
join. For merge join all the join quals have to be mergejoinable, but hash
join will work so long as there's at least one hashjoinable qual --- the
others can be any condition. (This is true essentially because we don't
keep per-inner-tuple match flags in merge join, while hash join can do so.)
To do this, we need a has-it-been-matched flag for each tuple in the
hashtable, not just one for the current outer tuple. The key idea that
makes this practical is that we can store the match flag in the tuple's
infomask, since there are lots of bits there that are of no interest for a
MinimalTuple. So we aren't increasing the size of the hashtable at all for
the feature.
To write this without turning the hash code into even more of a pile of
spaghetti than it already was, I rewrote ExecHashJoin in a state-machine
style, similar to ExecMergeJoin. Other than that decision, it was pretty
straightforward.
This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp. It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.
For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.
This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.
This is a heavily revised version of builtin_knngist_core-0.9. The
ordering operators are no longer mixed in with actual quals, which would
have confused not only humans but significant parts of the planner.
Instead, ordering operators are carried separately throughout planning and
execution.
Since the API for ambeginscan and amrescan functions had to be changed
anyway, this commit takes the opportunity to rationalize that a bit.
RelationGetIndexScan no longer forces a premature index_rescan call;
instead, callers of index_beginscan must call index_rescan too. Aside from
making the AM-side initialization logic a bit less peculiar, this has the
advantage that we do not make a useless extra am_rescan call when there are
runtime key values. AMs formerly could not assume that the key values
passed to amrescan were actually valid; now they can.
Teodor Sigaev and Tom Lane
There were corner cases in which the planner would attempt to inline such
a function, which would result in a failure at runtime due to loss of
information about exactly what the result record type is. Fix by disabling
inlining when the function's recorded result type is RECORD. There might
be some sub-cases where inlining could still be allowed, but this is a
simple and backpatchable fix, so leave refinements for another day.
Per bug #5777 from Nate Carson.
Back-patch to all supported branches. 8.1 happens to avoid a core-dump
here, but it still does the wrong thing.
If we have Limit->Result->Sort, the Result might be projecting a tlist
that contains a set-returning function. If so, it's possible for the
SRF to sometimes return zero rows, which means we could need to fetch
more than N rows from the Sort in order to satisfy LIMIT N.
So top-N sorting cannot be used in this scenario.
Fix things so that top-N sorting can be used in child Sort nodes of a
MergeAppend node, when there is a LIMIT and no intervening joins or
grouping. Actually doing this on the executor side isn't too bad,
but it's a bit messier to get the planner to cost it properly.
Per gripe from Robert Haas.
In passing, fix an oversight in the original top-N-sorting patch:
query_planner should not assume that a LIMIT can be used to make an
explicit sort cheaper when there will be grouping or aggregation in
between. Possibly this should be back-patched, but I'm not sure the
mistake is serious enough to be a real problem in practice.
and related routines.
We already had a redundant FunctionCallInfoData struct in FuncExprState,
but were using that copy only in set-returning-function cases, to avoid
keeping function evaluation state in the expression tree for the benefit
of plpgsql's "simple expression" logic. But of course that didn't work
anyway. Given the recent fixes in plpgsql there is no need to have two
separate behaviors here. Getting rid of the local FunctionCallInfoData
structs should make things a little faster (because we don't need to do
InitFunctionCallInfoData each time), and it also makes for a noticeable
reduction in stack space consumption during recursive calls.
This patch eliminates the former need to sort the output of an Append scan
when an ordered scan of an inheritance tree is wanted. This should be
particularly useful for fast-start cases such as queries with LIMIT.
Original patch by Greg Stark, with further hacking by Hans-Jurgen Schonig,
Robert Haas, and Tom Lane.
This patch adds the SQL-standard concept of an INSTEAD OF trigger, which
is fired instead of performing a physical insert/update/delete. The
trigger function is passed the entire old and/or new rows of the view,
and must figure out what to do to the underlying tables to implement
the update. So this feature can be used to implement updatable views
using trigger programming style rather than rule hacking.
In passing, this patch corrects the names of some columns in the
information_schema.triggers view. It seems the SQL committee renamed
them somewhere between SQL:99 and SQL:2003.
Dean Rasheed, reviewed by Bernd Helmle; some additional hacking by me.
transaction snapshots, i.e. a snapshot registered at the beginning of
a transaction. Change variable naming and comments to reflect this reality
in preparation for a future, truly serializable mode, e.g.
Serializable Snapshot Isolation (SSI).
For the moment transaction snapshots are still used to implement
SERIALIZABLE, but hopefully not for too much longer. Patch by Kevin
Grittner and Dan Ports with review and some minor wording changes by me.
returning "record" actually do have the same rowtype. This is needed because
the parser can't realistically enforce that they will all have the same typmod,
as seen in a recent example from David Wheeler.
Back-patch to 8.0, which is as far back as we have the notion of RECORD
subtypes being distinguished by typmod. Wheeler's example depends on
8.4-and-up features, but I suspect there may be ways to provoke similar
failures before 8.4.
ExecModifyTable(). This avoids memory leakage when trigger functions leave
junk behind in that context (as they more or less must). Problem and solution
identified by Dean Rasheed.
I'm a bit concerned about the longevity of this solution --- once a plan can
have multiple ModifyTable nodes, we are very possibly going to have to do
something different. But it should hold up for 9.0.
unqualified names.
- Add a missing_ok parameter to get_tablespace_oid.
- Avoid duplicating get_tablespace_od guts in objectNamesToOids.
- Add a missing_ok parameter to get_database_oid.
- Replace get_roleid and get_role_checked with get_role_oid.
- Add get_namespace_oid, get_language_oid, get_am_oid.
- Refactor existing code to use new interfaces.
Thanks to KaiGai Kohei for the review.
list in ExecLockRows() forgot to allow for the possibility that some of the
rowmarks are for child tables that aren't relevant to the current row.
Per report from Kenichiro Tanaka.
a pass-by-reference datatype with a nontrivial projection step.
We were using the same memory context for the projection operation as for
the temporary context used by the hashtable routines in execGrouping.c.
However, the hashtable routines feel free to reset their temp context at
any time, which'd lead to destroying input data that was still needed.
Report and diagnosis by Tao Ma.
Back-patch to 8.1, where the problem was introduced by the changes that
allowed us to work with "virtual" tuples instead of materializing intermediate
tuple values everywhere. The earlier code looks quite similar, but it doesn't
suffer the problem because the data gets copied into another context as a
result of having to materialize ExecProject's output tuple.
Remove bespoke code in DoCopy and RI_Initial_Check, which now instead
fabricate call ExecCheckRTPerms with a manufactured RangeTblEntry.
This is intended to make it feasible for an enhanced security provider
to actually make use of ExecutorCheckPerms_hook, but also has the
advantage that RI_Initial_Check can allow use of the fast-path when
column-level but not table-level permissions are present.
KaiGai Kohei. Reviewed (in an earlier version) by Stephen Frost, and by me.
Some further changes to the comments by me.
from defining non-self-conflicting constraints.
Jeff Davis
Note: I (tgl) objected to removing this check in 9.0 on the grounds that it
was an important sanity check in new, poorly tested code. However, it should
be all right to remove it for 9.1, since we'll get field testing from the
9.0 branch.
relation using the general PARAM_EXEC executor parameter mechanism, rather
than the ad-hoc kluge of passing the outer tuple down through ExecReScan.
The previous method was hard to understand and could never be extended to
handle parameters coming from multiple join levels. This patch doesn't
change the set of possible plans nor have any significant performance effect,
but it's necessary infrastructure for future generalization of the concept
of an inner indexscan plan.
ExecReScan's second parameter is now unused, so it's removed.
This hook allows a loadable module to gain control when table permissions
are checked. It is expected to be used by an eventual SE-PostgreSQL
implementation, but there are other possible applications as well. A
sample contrib module can be found in the archives at:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01095.php
Robert Haas and Stephen Frost
is treated like end-of-input, if nulls sort last in that column and we are not
doing outer-join filling for that input. In such a case, the tuple cannot
join to anything from the other input (because we assume mergejoinable
operators are strict), and neither can any tuple following it in the sort
order. If we're not interested in doing outer-join filling we can just
pretend the tuple and its successors aren't there at all. This can save a
great deal of time in situations where there are many nulls in the join
column, as in a recent example from Scott Marlowe. Also, since the planner
tends to not count nulls in its mergejoin scan selectivity estimates, this
is an important fix to make the runtime behavior more like the estimate.
I regard this as an omission in the patch I wrote years ago to teach mergejoin
that tuples containing nulls aren't joinable, so I'm back-patching it. But
only to 8.3 --- in older versions, we didn't have a solid notion of whether
nulls sort high or low, so attempting to apply this optimization could break
things.
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.