is contention for a tuple-level lock. This solves the problem of a
would-be exclusive locker being starved out by an indefinite succession
of share-lockers. Per recent discussion with Alvaro.
Essentially, we shoehorn in a lockable-object-type field by taking
a byte away from the lockmethodid, which can surely fit in one byte
instead of two. This allows less artificial definitions of all the
other fields of LOCKTAG; we can get rid of the special pg_xactlock
pseudo-relation, and also support locks on individual tuples and
general database objects (including shared objects). None of those
possibilities are actually exploited just yet, however.
I removed pg_xactlock from pg_class, but did not force initdb for
that change. At this point, relkind 's' (SPECIAL) is unused and
could be removed entirely.
to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets. When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX. This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared. Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.
Alvaro Herrera and Tom Lane.
indexes. Replace all heap_openr and index_openr calls by heap_open
and index_open. Remove runtime lookups of catalog OID numbers in
various places. Remove relcache's support for looking up system
catalogs by name. Bulky but mostly very boring patch ...
avoid encroaching on the 'user' range of OIDs by allowing automatic
OID assignment to use values below 16k until we reach normal operation.
initdb not forced since this doesn't make any incompatible change;
however a lot of stuff will have different OIDs after your next initdb.
PageIndexTupleDelete() with a single pass of compactification ---
logic mostly lifted from PageRepairFragmentation. I noticed while
profiling that a VACUUM that's cleaning up a whole lot of deleted
tuples would spend as much as a third of its CPU time in
PageIndexTupleDelete; not too surprising considering the loop method
was roughly O(N^2) in the number of tuples involved.
to write out data that we are about to tell the filesystem to drop.
smgr_internal_unlink already had a DropRelFileNodeBuffers call to
get rid of dead buffers without a write after it's no longer possible
to roll back the deleting transaction. Adding a similar call in
smgrtruncate simplifies callers and makes the overall division of
labor clearer. This patch removes the former behavior that VACUUM
would write all dirty buffers of a relation unconditionally.
the freelist, plus per-buffer spinlocks that protect access to individual
shared buffer headers. This requires abandoning a global freelist (since
the freelist is a global contention point), which shoots down ARC and 2Q
as well as plain LRU management. Adopt a clock sweep algorithm instead.
Preliminary results show substantial improvement in multi-backend situations.
no held locks. This maintains the invariant that proclocks are present
only for procs that are holding or awaiting a lock; when this is not
true, LockRelease will fail. Per report from Stephen Clouse.
is the minimum required fix. I want to look next at taking advantage of
it by simplifying the message semantics in the shared inval message queue,
but that part can be held over for 8.1 if it turns out too ugly.
share lock on a buffer being written out before releasing BufMgrLock in
the BufferAlloc code path; if we do it later we might block on someone
who's re-pinned the buffer. I believe this is only an issue for BufferAlloc
and not the other places that call FlushBuffer. BufferSync must continue
to do it the old way since it may well be trying to write buffers that
other backends have pinned; but it should not be holding any conflicting
locks. FlushRelationBuffers is okay since it's got exclusive lock at the
relation level.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
to shared memory as soon as possible, ie, right after read_backend_variables.
The effective difference from the original code is that this happens
before instead of after read_nondefault_variables(), which loads GUC
information and is apparently capable of expanding the backend's memory
allocation more than you'd think it should. This should fix the
failure-to-attach-to-shared-memory reports we've been seeing on Windows.
Also clean up a few bits of unnecessarily grotty EXEC_BACKEND code.
no need for it to be nearly as big as the global hash table, and since
it's not in shared memory it can grow if it does need to be bigger.
By reducing the size, we speed up hash_seq_search(), which saves a
significant fraction of subtransaction entry/exit overhead.
returning a NULL pointer (some callers remembered to check the return
value, but some did not -- it is safer to just bail out).
Also, cleanup pgstat.c to use elog(ERROR) rather than elog(LOG) followed
by exit().
This does not disable the bgwriter process: it still has to wake up often
enough to collect fsync requests from backends in a timely fashion. But
it responds to the recent gripe about not being able to prevent the disk
from being spun up constantly.
pins at end of transaction, and reduce AtEOXact_Buffers to an Assert
cross-check that this was done correctly. When not USE_ASSERT_CHECKING,
AtEOXact_Buffers is a complete no-op. This gets rid of an O(NBuffers)
bottleneck during transaction commit/abort, which recent testing has shown
becomes significant above a few tens of thousands of shared buffers.
(if any) currently waited for by LockBufferForCleanup(), which is all
that we were using it for anymore. Saves some space and eliminates
proportional-to-NBuffers slowdown in UnlockBuffers().
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php.
This fix is intended to be permanent: it moves the responsibility for
calling SetBufferCommitInfoNeedsSave() into the tqual.c routines,
eliminating the requirement for callers to test whether t_infomask changed.
Also, tighten validity checking on buffer IDs in bufmgr.c --- several
routines were paranoid about out-of-range shared buffer numbers but not
about out-of-range local ones, which seems a tad pointless.
pg_subtrans --- what we need is the oldest xmin of any snapshot in use
in the current top transaction. Introduce a new variable TransactionXmin
to play this role. Fixes intermittent regression failure reported by
Neil Conway.
as per recent discussions. Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it. This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that. This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions). Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases. This avoids holding many
unique locks after a long series of subtransactions. The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
Asserts would lead to a server core dump if an error occurred while
trying to abort a failed subtransaction (thereby leading to re-execution
of whatever parts of AbortSubTransaction had already run). This of course
does not prevent such an error from creating an infinite loop, but at
least we don't make the situation worse. Responds to an open item on
the subtransactions to-do list.