Commit e7cb7ee14555cc9c5773e2c102efd6371f6f2005 included some design
decisions that seem pretty questionable to me, and there was quite a lot
of stuff not to like about the documentation and comments. Clean up
as follows:
* Consider foreign joins only between foreign tables on the same server,
rather than between any two foreign tables with the same underlying FDW
handler function. In most if not all cases, the FDW would simply have had
to apply the same-server restriction itself (far more expensively, both for
lack of caching and because it would be repeated for each combination of
input sub-joins), or else risk nasty bugs. Anyone who's really intent on
doing something outside this restriction can always use the
set_join_pathlist_hook.
* Rename fdw_ps_tlist/custom_ps_tlist to fdw_scan_tlist/custom_scan_tlist
to better reflect what they're for, and allow these custom scan tlists
to be used even for base relations.
* Change make_foreignscan() API to include passing the fdw_scan_tlist
value, since the FDW is required to set that. Backwards compatibility
doesn't seem like an adequate reason to expect FDWs to set it in some
ad-hoc extra step, and anyway existing FDWs can just pass NIL.
* Change the API of path-generating subroutines of add_paths_to_joinrel,
and in particular that of GetForeignJoinPaths and set_join_pathlist_hook,
so that various less-used parameters are passed in a struct rather than
as separate parameter-list entries. The objective here is to reduce the
probability that future additions to those parameter lists will result in
source-level API breaks for users of these hooks. It's possible that this
is even a small win for the core code, since most CPU architectures can't
pass more than half a dozen parameters efficiently anyway. I kept root,
joinrel, outerrel, innerrel, and jointype as separate parameters to reduce
code churn in joinpath.c --- in particular, putting jointype into the
struct would have been problematic because of the subroutines' habit of
changing their local copies of that variable.
* Avoid ad-hocery in ExecAssignScanProjectionInfo. It was probably all
right for it to know about IndexOnlyScan, but if the list is to grow
we should refactor the knowledge out to the callers.
* Restore nodeForeignscan.c's previous use of the relcache to avoid
extra GetFdwRoutine lookups for base-relation scans.
* Lots of cleanup of documentation and missed comments. Re-order some
code additions into more logical places.
Foreign data wrappers can use this capability for so-called "join
pushdown"; that is, instead of executing two separate foreign scans
and then joining the results locally, they can generate a path which
performs the join on the remote server and then is scanned locally.
This commit does not extend postgres_fdw to take advantage of this
capability; it just provides the infrastructure.
Custom scan providers can use this in a similar way. Previously,
it was only possible for a custom scan provider to scan a single
relation. Now, it can scan an entire join tree, provided of course
that it knows how to produce the same results that the join would
have produced if executed normally.
KaiGai Kohei, reviewed by Shigeru Hanada, Ashutosh Bapat, and me.
This commit changes index-only scans so that data is read directly from the
index tuple without first generating a faux heap tuple. The only immediate
benefit is that indexes on system columns (such as OID) can be used in
index-only scans, but this is necessary infrastructure if we are ever to
support index-only scans on expression indexes. The executor is now ready
for that, though the planner still needs substantial work to recognize
the possibility.
To do this, Vars in index-only plan nodes have to refer to index columns
not heap columns. I introduced a new special varno, INDEX_VAR, to mark
such Vars to avoid confusion. (In passing, this commit renames the two
existing special varnos to OUTER_VAR and INNER_VAR.) This allows
ruleutils.c to handle them with logic similar to what we use for subplan
reference Vars.
Since index-only scans are now fundamentally different from regular
indexscans so far as their expression subtrees are concerned, I also chose
to change them to have their own plan node type (and hence, their own
executor source file).
This provides information about the numbers of tuples that were visited
but not returned by table scans, as well as the numbers of join tuples
that were considered and discarded within a join plan node.
There is still some discussion going on about the best way to report counts
for outer-join situations, but I think most of what's in the patch would
not change if we revise that, so I'm going to go ahead and commit it as-is.
Documentation changes to follow (they weren't in the submitted patch
either).
Marko Tiikkaja, reviewed by Marc Cousin, somewhat revised by Tom
there's no quals or projections. Currently this only matters for foreign
scans, as none of the other scan nodes litter the per-tuple memory context
when there's no quals or projections.
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
made query plan. Use of ALTER COLUMN TYPE creates a hazard for cached
query plans: they could contain Vars that claim a column has a different
type than it now has. Fix this by checking during plan startup that Vars
at relation scan level match the current relation tuple descriptor. Since
at that point we already have at least AccessShareLock, we can be sure the
column type will not change underneath us later in the query. However,
since a backend's locks do not conflict against itself, there is still a
hole for an attacker to exploit: he could try to execute ALTER COLUMN TYPE
while a query is in progress in the current backend. Seal that hole by
rejecting ALTER TABLE whenever the target relation is already open in
the current backend.
This is a significant security hole: not only can one trivially crash the
backend, but with appropriate misuse of pass-by-reference datatypes it is
possible to read out arbitrary locations in the server process's memory,
which could allow retrieving database content the user should not be able
to see. Our thanks to Jeff Trout for the initial report.
Security: CVE-2007-0556
involving unions of types having typmods. Variants of the failure are known
to occur in 8.1 and up; not sure if it's possible in 8.0 and 7.4, but since
the code exists that far back, I'll just patch 'em all. Per report from
Brian Hurt.
aren't doing anything useful (ie, neither selection nor projection).
Also, extend to SubqueryScan the hacks already in place to avoid
unnecessary ExecProject calls when the result would just be the same
tuple the subquery already delivered. This saves some overhead in
UNION and other set operations, as well as avoiding overhead for
unflatten-able subqueries. Per example from Sokolov Yura.
of tuples when passing data up through multiple plan nodes. A slot can now
hold either a normal "physical" HeapTuple, or a "virtual" tuple consisting
of Datum/isnull arrays. Upper plan levels can usually just copy the Datum
arrays, avoiding heap_formtuple() and possible subsequent nocachegetattr()
calls to extract the data again. This work extends Atsushi Ogawa's earlier
patch, which provided the key idea of adding Datum arrays to TupleTableSlots.
(I believe however that something like this was foreseen way back in Berkeley
days --- see the old comment on ExecProject.) A test case involving many
levels of join of fairly wide tables (about 80 columns altogether) showed
about 3x overall speedup, though simple queries will probably not be
helped very much.
I have also duplicated some code in heaptuple.c in order to provide versions
of heap_formtuple and friends that use "bool" arrays to indicate null
attributes, instead of the old convention of "char" arrays containing either
'n' or ' '. This provides a better match to the convention used by
ExecEvalExpr. While I have not made a concerted effort to get rid of uses
of the old routines, I think they should be deprecated and eventually removed.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
In the past, we used a 'Lispy' linked list implementation: a "list" was
merely a pointer to the head node of the list. The problem with that
design is that it makes lappend() and length() linear time. This patch
fixes that problem (and others) by maintaining a count of the list
length and a pointer to the tail node along with each head node pointer.
A "list" is now a pointer to a structure containing some meta-data
about the list; the head and tail pointers in that structure refer
to ListCell structures that maintain the actual linked list of nodes.
The function names of the list API have also been changed to, I hope,
be more logically consistent. By default, the old function names are
still available; they will be disabled-by-default once the rest of
the tree has been updated to use the new API names.
when scanning a table that we need all the columns from. In case of
SELECT INTO, we have to check that the hasoids flag matches the desired
output type, too. Per report from Mike Mascari.
nodes where it's not really necessary. In many cases where the scan node
is not the topmost plan node (eg, joins, aggregation), it's possible to
just return the table tuple directly instead of generating an intermediate
projection tuple. In preliminary testing, this reduced the CPU time
needed for 'SELECT COUNT(*) FROM foo' by about 10%.
to plan nodes, not vice-versa. All executor state nodes now inherit from
struct PlanState. Copying of plan trees has been simplified by not
storing a list of SubPlans in Plan nodes (eliminating duplicate links).
The executor still needs such a list, but it can build it during
ExecutorStart since it has to scan the plan tree anyway.
No initdb forced since no stored-on-disk structures changed, but you
will need a full recompile because of node-numbering changes.
for example, an SQL function can be used in a functional index. (I make
no promises about speed, but it'll work ;-).) Clean up and simplify
handling of functions returning sets.
There's now only one transition value and transition function.
NULL handling in aggregates is a lot cleaner. Also, use Numeric
accumulators instead of integer accumulators for sum/avg on integer
datatypes --- this avoids overflow at the cost of being a little slower.
Implement VARIANCE() and STDDEV() aggregates in the standard backend.
Also, enable new LIKE selectivity estimators by default. Unrelated
change, but as long as I had to force initdb anyway...