Move checking for unscannable matviews into ExecOpenScanRelation, which is
a better place for it first because the open relation is already available
(saving a relcache lookup cycle), and second because this eliminates the
problem of telling the difference between rangetable entries that will or
will not be scanned by the query. In particular we can get rid of the
not-terribly-well-thought-out-or-implemented isResultRel field that the
initial matviews patch added to RangeTblEntry.
Also get rid of entirely unnecessary scannability check in the rewriter,
and a bogus decision about whether RefreshMatViewStmt requires a parse-time
snapshot.
catversion bump due to removal of a RangeTblEntry field, which changes
stored rules.
This provides information about the numbers of tuples that were visited
but not returned by table scans, as well as the numbers of join tuples
that were considered and discarded within a join plan node.
There is still some discussion going on about the best way to report counts
for outer-join situations, but I think most of what's in the patch would
not change if we revise that, so I'm going to go ahead and commit it as-is.
Documentation changes to follow (they weren't in the submitted patch
either).
Marko Tiikkaja, reviewed by Marc Cousin, somewhat revised by Tom
It's been like this since HOT was originally introduced, but the logic
is complex enough that this is a recipe for bugs, as we've already
found out with SSI. So refactor heap_hot_search_buffer() so that it
can satisfy the needs of index_getnext(), and make index_getnext() use
that rather than duplicating the logic.
This change was originally proposed by Heikki Linnakangas as part of a
larger refactoring oriented towards allowing index-only scans. I
extracted and adjusted this part, since it seems to have independent
merit. Review by Jeff Davis.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
relation using the general PARAM_EXEC executor parameter mechanism, rather
than the ad-hoc kluge of passing the outer tuple down through ExecReScan.
The previous method was hard to understand and could never be extended to
handle parameters coming from multiple join levels. This patch doesn't
change the set of possible plans nor have any significant performance effect,
but it's necessary infrastructure for future generalization of the concept
of an inner indexscan plan.
ExecReScan's second parameter is now unused, so it's removed.
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
the same page we are nanoseconds away from reading for real. There should be
something left to do on the current page before we consider issuing a prefetch.
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.
(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)
Greg Stark
bitmap. This is extracted from Greg Stark's posix_fadvise patch; it seems
worth committing separately, since it's potentially useful independently of
posix_fadvise.
GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not
depend on the latter being correctly set elsewhere, and while it is more
expensive, this code path is not performance-critical. This is a real
risk for autovacuum, because it can execute whole cycles without doing
a single vacuum, which would mean that RecentGlobalXmin would stay at its
initialization value, FirstNormalTransactionId, causing a bogus value to be
inserted in pg_database. This bug could explain some recent reports of
failure to truncate pg_clog.
At the same time, change the initialization of RecentGlobalXmin to
InvalidTransactionId, and ensure that it's set to something else whenever
it's going to be used. Using it as FirstNormalTransactionId in HOT page
pruning could incur in data loss. InitPostgres takes care of setting it
to a valid value, but the extra checks are there to prevent "special"
backends from behaving in unusual ways.
Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs. This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.
In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited. This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries. There might be other uses in future, but that one
alone is sufficient reason.
Heikki Linnakangas, with some adjustments by me.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time). This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len. The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
pseudo HeapScanDesc created for a bitmap heap scan. This avoids some useless
overhead during a bitmap scan startup, in particular invoking the syncscan
code. (We might someday want to do that, but right now it's merely useless
contention for shared memory, to say nothing of possibly pushing useful
entries out of syncscan's small LRU list.) This also allows elimination of
ugly pgstat_discount_heap_scan() kluge.
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.
Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures. And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
ps_TupFromTlist in plan nodes that make use of it. This was being done
correctly in join nodes and Result nodes but not in any relation-scan nodes.
Bug would lead to bogus results if a set-returning function appeared in the
targetlist of a subquery that could be rescanned after partial execution,
for example a subquery within EXISTS(). Bug has been around forever :-(
... surprising it wasn't reported before.
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries. The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient. Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
any use in the past many years, we'd have made some effort to include
them in all executor node types; but in fact they were only in
nodeAppend.c and nodeIndexscan.c, up until I copied nodeIndexscan.c's
occurrence into the new bitmap node types. Remove some other unused
macros in execdebug.h, too. Some day the whole header probably ought to
go away in favor of better-designed facilities.
bits indicating which optional capabilities can actually be exercised
at runtime. This will allow Sort and Material nodes, and perhaps later
other nodes, to avoid unnecessary overhead in common cases.
This commit just adds the infrastructure and arranges to pass the correct
flag values down to plan nodes; none of the actual optimizations are here
yet. I'm committing this separately in case anyone wants to measure the
added overhead. (It should be negligible.)
Simon Riggs and Tom Lane
relation if it's already been locked by execMain.c as either a result
relation or a FOR UPDATE/SHARE relation. This avoids an extra trip to
the shared lock manager state. Per my suggestion yesterday.
child plan nodes until we have acquired lock on the relation to scan.
The relative order of initialization of plan nodes isn't real important in
other cases, but it's critical here because one is supposed to lock a
relation before its indexes, not vice versa. The original coding was at
least vulnerable to deadlock against DROP INDEX, and perhaps worse things.
when we first read the page, rather than checking them one at a time.
This allows us to take and release the buffer content lock just once
per page, instead of once per tuple. Since it's a shared lock the
contention penalty for holding the lock longer shouldn't be too bad.
We can safely do this only when using an MVCC snapshot; else the
assumption that visibility won't change over time is uncool. Therefore
there are now two code paths depending on the snapshot type. I also
made the same change in nodeBitmapHeapscan.c, where it can be done always
because we only support MVCC snapshots for bitmap scans anyway.
Also make some incidental cleanups in the APIs of these functions.
Per a suggestion from Qingqing Zhou.
a TupleTableSlot: instead of calling ExecClearTuple, inline the needed
operations, so that we can avoid redundant steps. In particular, when
the old and new tuples are both on the same disk page, avoid releasing
and re-acquiring the buffer pin --- this saves work in both the bufmgr
and ResourceOwner modules. To make this improvement actually useful,
partially revert a change I made on 2004-04-21 that caused SeqNext
et al to call ExecClearTuple before ExecStoreTuple. The motivation
for that, to avoid grabbing the BufMgrLock separately for releasing
the old buffer and grabbing the new one, no longer applies. My
profiling says that this saves about 5% of the CPU time for an
all-in-memory seqscan.