relation using the general PARAM_EXEC executor parameter mechanism, rather
than the ad-hoc kluge of passing the outer tuple down through ExecReScan.
The previous method was hard to understand and could never be extended to
handle parameters coming from multiple join levels. This patch doesn't
change the set of possible plans nor have any significant performance effect,
but it's necessary infrastructure for future generalization of the concept
of an inner indexscan plan.
ExecReScan's second parameter is now unused, so it's removed.
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
a user-supplied TID is out of range for the relation. This is needed to
preserve compatibility with our pre-8.3 behavior, and it is sensible anyway
since if the query were implemented by brute force rather than optimized
into a TidScan, the behavior for a non-existent TID would be zero rows out,
never an error. Per gripe from Gurjeet Singh.
then-delete on the current cursor row. The basic fix is that nodeTidscan.c
has to apply heap_get_latest_tid() to the current-scan-TID obtained from the
cursor query; this ensures we get the latest row version to work with.
However, since that only works if the query plan is a TID scan, we also have
to hack the planner to make sure only that type of plan will be selected.
(Formerly, the planner might decide to apply a seqscan if the table is very
small. This change is probably a Good Thing anyway, since it's hard to see
how a seqscan could really win.) That means the execQual.c code to support
CurrentOfExpr as a regular expression type is dead code, so replace it with
just an elog(). Also, add regression tests covering these cases. Note
that the added tests expose the fact that re-fetching an updated row
misbehaves if the cursor used FOR UPDATE. That's an independent bug that
should be fixed later. Per report from Dharmendra Goyal.
with a plpgsql-defined cursor. The underlying mechanism for this is that the
main SQL engine will now take "WHERE CURRENT OF $n" where $n is a refcursor
parameter. Not sure if we should document that fact or consider it an
implementation detail. Per discussion with Pavel Stehule.
Along the way, allow FOR UPDATE in non-WITH-HOLD cursors; there may once
have been a reason to disallow that, but it seems to work now, and it's
really rather necessary if you want to select a row via a cursor and then
update it in a concurrent-safe fashion.
Original patch by Arul Shaji, rather heavily editorialized by Tom Lane.
ps_TupFromTlist in plan nodes that make use of it. This was being done
correctly in join nodes and Result nodes but not in any relation-scan nodes.
Bug would lead to bogus results if a set-returning function appeared in the
targetlist of a subquery that could be rescanned after partial execution,
for example a subquery within EXISTS(). Bug has been around forever :-(
... surprising it wasn't reported before.
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries. The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient. Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
bits indicating which optional capabilities can actually be exercised
at runtime. This will allow Sort and Material nodes, and perhaps later
other nodes, to avoid unnecessary overhead in common cases.
This commit just adds the infrastructure and arranges to pass the correct
flag values down to plan nodes; none of the actual optimizations are here
yet. I'm committing this separately in case anyone wants to measure the
added overhead. (It should be negligible.)
Simon Riggs and Tom Lane
relation if it's already been locked by execMain.c as either a result
relation or a FOR UPDATE/SHARE relation. This avoids an extra trip to
the shared lock manager state. Per my suggestion yesterday.
"ctid IN (list)" will still work after we convert IN to ScalarArrayOpExpr.
Make some minor efficiency improvements while at it, such as ensuring that
multiple TIDs are fetched in physical heap order. And fix EXPLAIN so that
it shows what's really going on for a TID scan.
a TupleTableSlot: instead of calling ExecClearTuple, inline the needed
operations, so that we can avoid redundant steps. In particular, when
the old and new tuples are both on the same disk page, avoid releasing
and re-acquiring the buffer pin --- this saves work in both the bufmgr
and ResourceOwner modules. To make this improvement actually useful,
partially revert a change I made on 2004-04-21 that caused SeqNext
et al to call ExecClearTuple before ExecStoreTuple. The motivation
for that, to avoid grabbing the BufMgrLock separately for releasing
the old buffer and grabbing the new one, no longer applies. My
profiling says that this saves about 5% of the CPU time for an
all-in-memory seqscan.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
In the past, we used a 'Lispy' linked list implementation: a "list" was
merely a pointer to the head node of the list. The problem with that
design is that it makes lappend() and length() linear time. This patch
fixes that problem (and others) by maintaining a count of the list
length and a pointer to the tail node along with each head node pointer.
A "list" is now a pointer to a structure containing some meta-data
about the list; the head and tail pointers in that structure refer
to ListCell structures that maintain the actual linked list of nodes.
The function names of the list API have also been changed to, I hope,
be more logically consistent. By default, the old function names are
still available; they will be disabled-by-default once the rest of
the tree has been updated to use the new API names.
the next are handled by ReleaseAndReadBuffer rather than separate
ReleaseBuffer and ReadBuffer calls. This cuts the number of acquisitions
of the BufMgrLock by a factor of 2 (possibly more, if an indexscan happens
to pull successive rows from the same heap page). Unfortunately this
doesn't seem enough to get us out of the recently discussed context-switch
storm problem, but it's surely worth doing anyway.
locParam lists can be converted to bitmapsets to speed updating. Also,
replace 'locParam' with 'allParam', which contains all the paramIDs
relevant to the node (i.e., the union of extParam and locParam); this
saves a step during SetChangedParamList() without costing anything
elsewhere.
nodes where it's not really necessary. In many cases where the scan node
is not the topmost plan node (eg, joins, aggregation), it's possible to
just return the table tuple directly instead of generating an intermediate
projection tuple. In preliminary testing, this reduced the CPU time
needed for 'SELECT COUNT(*) FROM foo' by about 10%.
a per-query memory context created by CreateExecutorState --- and destroyed
by FreeExecutorState. This provides a final solution to the longstanding
problem of memory leaked by various ExecEndNode calls.
execution state trees, and ExecEvalExpr takes an expression state tree
not an expression plan tree. The plan tree is now read-only as far as
the executor is concerned. Next step is to begin actually exploiting
this property.
to plan nodes, not vice-versa. All executor state nodes now inherit from
struct PlanState. Copying of plan trees has been simplified by not
storing a list of SubPlans in Plan nodes (eliminating duplicate links).
The executor still needs such a list, but it can build it during
ExecutorStart since it has to scan the plan tree anyway.
No initdb forced since no stored-on-disk structures changed, but you
will need a full recompile because of node-numbering changes.
transaction, so as to avoid returning them out of the index AM. Saves
repeated heap_fetch operations on frequently-updated rows. Also detect
queries on unique keys (equality to all columns of a unique index), and
don't bother continuing scan once we have found first match.
Killing is implemented in the btree and hash AMs, but not yet in rtree
or gist, because there isn't an equally convenient place to do it in
those AMs (the outer amgetnext routine can't do it without re-pinning
the index page).
Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and
index_insert to make this a little easier.
Improve 'pg_internal.init' relcache entry preload mechanism so that it is
safe to use for all system catalogs, and arrange to preload a realistic
set of system-catalog entries instead of only the three nailed-in-cache
indexes that were formerly loaded this way. Fix mechanism for deleting
out-of-date pg_internal.init files: this must be synchronized with transaction
commit, not just done at random times within transactions. Drive it off
relcache invalidation mechanism so that no special-case tests are needed.
Cache additional information in relcache entries for indexes (their pg_index
tuples and index-operator OIDs) to eliminate repeated lookups. Also cache
index opclass info at the per-opclass level to avoid repeated lookups during
relcache load.
Generalize 'systable scan' utilities originally developed by Hiroshi,
move them into genam.c, use in a number of places where there was formerly
ugly code for choosing either heap or index scan. In particular this allows
simplification of the logic that prevents infinite recursion between syscache
and relcache during startup: we can easily switch to heapscans in relcache.c
when and where needed to avoid recursion, so IndexScanOK becomes simpler and
does not need any expensive initialization.
Eliminate useless opening of a heapscan data structure while doing an indexscan
(this saves an mdnblocks call and thus at least one kernel call).
inner indexscan (ie, one with runtime keys). ExecIndexReScan must
compute or recompute runtime keys even if we are rescanning in the
EPQ case. TidScan seems to have comparable problems. Per bug
noted by Barry Lind 11-Feb-02.