the marked-for-death flag. It was only set for a fleeting moment while a
transaction was being cleaned up at rollback. All the places that checked
for the rolled-back flag should also check the marked-for-death flag, as
both flags mean that the transaction will roll back. I also renamed the
marked-for-death into "doomed", which is a lot shorter name.
snapshots, like in REINDEX, are basically non-transactional operations. The
DDL operation itself might participate in SSI, but there's separate
functions for that.
Kevin Grittner and Dan Ports, with some changes by me.
The old code creates three separate arrays when only one is needed,
using two different shmem allocation functions for no obvious reason.
It also strangely splits up the initialization of AuxilaryProcs
between the top and bottom of the function to no evident purpose.
Review by Tom Lane.
Even if a flag is modified only by the backend owning the transaction, it's
not safe to modify it without a lock. Another backend might be setting or
clearing a different flag in the flags field concurrently, and that
operation might be lost because setting or clearing a bit in a word is not
atomic.
Make did-write flag a simple backend-private boolean variable, because it
was only set or tested in the owning backend (except when committing a
prepared transaction, but it's not worthwhile to optimize for the case of a
read-only prepared transaction). This also eliminates the need to add
locking where that flag is set.
Also, set the did-write flag when doing DDL operations like DROP TABLE or
TRUNCATE -- that was missed earlier.
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away. We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.
To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends. We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)
This commit fixes a bug in the previous one, which neglected to cleanly
handle the LRU ring that fd.c uses to manage open files, and caused an
unacceptable failure just before beta2 and was thus reverted.
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away. We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.
To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends. We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)
SimpleLruTruncate() a page number that's "in the future", because it will
issue a warning and refuse to truncate anything. Instead, we leave behind
the latest segment. If the slru is not needed before XID wrap-around, the
segment will appear as new again, and not be cleaned up until it gets old
enough again. That's a bit unpleasant, but better than not cleaning up
anything.
Also, fix broken calculation to check and warn if the span of the OldSerXid
SLRU is getting too large to fit in the 64k SLRU pages that we have
available. It was not XID wraparound aware.
Kevin Grittner and me.
Truncating or dropping a table is treated like deletion of all tuples, and
check for conflicts accordingly. If a table is clustered or rewritten by
ALTER TABLE, all predicate locks on the heap are promoted to relation-level
locks, because the tuple or page ids of any existing tuples will change and
won't be valid after rewriting the table. Arguably ALTER TABLE should be
treated like a mass-UPDATE of every row, but if you e.g change the datatype
of a column, you could also argue that it's just a change to the physical
layout, not a logical change. Reindexing promotes all locks on the index to
relation-level lock on the heap.
Kevin Grittner, with a lot of cosmetic changes by me.
For consistency, have all non-ASCII characters from contributors'
names in the source be in UTF-8. But remove some other more
gratuitous uses of non-ASCII characters.
On further analysis, it turns out that it is not needed to duplicate predicate
locks to the new row version at update, the lock on the version that the
transaction saw as visible is enough. However, there was a different bug in
the code that checks for dangerous structures when a new rw-conflict happens.
Fix that bug, and remove all the row-version chaining related code.
Kevin Grittner & Dan Ports, with some comment editorialization by me.
The hash table is seq scanned at transaction end, to release all locks,
and making the hash table larger than necessary makes that slower. With
very simple queries, that overhead can amount to a few percent of the total
CPU time used.
At the moment, backend startup needs 6 locks, and a simple query with one
table and index needs 3 locks. 16 is enough for even quite complicated
transactions, and it will grow automatically if it fills up.
This warning is new in gcc 4.6 and part of -Wall. This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.
Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)
I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.
Dan Ports, to fix leak of predicate locks reported by YAMAMOTO Takashi.
Otherwise, the SLRU machinery can get confused and think that the SLRU
has wrapped around. Along the way, regardless of whether we're
truncating all of the SLRU or just some of it, flush pages after
truncating, rather than before.
Kevin Grittner
When a regular lock is held, SSI can use that in lieu of a predicate lock
to detect rw conflicts; but if the regular lock is being taken by a
subtransaction, we can't assume that it'll commit, so releasing the
parent transaction's lock in that case is a no-no.
Kevin Grittner
If we call hash_search() with HASH_ENTER, it will bail out rather than
return NULL, so it's redundant to check for NULL again in the caller.
Thus, in cases where we believe it's impossible for the hash table to run
out of slots anyway, we can simplify the code slightly.
On the flip side, in cases where it's theoretically possible to run out of
space, we don't want to rely on dynahash.c to throw an error; instead,
we pass HASH_ENTER_NULL and throw the error ourselves if a NULL comes
back, so that we can provide a more descriptive error message.
Kevin Grittner
When we release and reacquire SerializableXactHashLock, we must recheck
whether an R/W conflict still needs to be flagged, because it could have
changed under us in the meantime. And when we release the partition
lock, we must re-walk the list of predicate locks from the beginning,
because our pointer could get invalidated under us.
Bug report #5952 by Yamamoto Takashi. Patch by Kevin Grittner.
examining the head of predicate locks list. Also, fix the comment of
RemoveTargetIfNoLongerUsed, it was neglected when we changed the way update
chains are handled.
Kevin Grittner
Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.
race condition where SummarizeOldestCommittedSxact() is called even though
another backend already cleared out all finished sxact entries. That's OK,
RegisterSerializableTransactionInt() can just retry getting a news xact
slot from the available-list when that happens.
Reported by YAMAMOTO Takashi, bug #5918.
contains newly-inserted tuples that according to our OldestXmin are not
yet visible to everyone. The value returned by GetOldestXmin() is conservative,
and it can move backwards on repeated calls, so if we see that contradiction
between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to
assume it's because an earlier vacuum calculated a higher OldestXmin value,
and all the tuples really are visible to everyone.
We have received several reports of this bug, with the "PD_ALL_VISIBLE flag
was incorrectly set in relation ..." warning appearing in logs. We were
finally able to hunt it down with David Gould's help to run extra diagnostics
in an environment where this happened frequently.
Also reword the warning, per Robert Haas' suggestion, to not imply that the
PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom
of corruption on a tuple header.
Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.
This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.
This version has very strict behaviour; more relaxed options
may be added at a later date.
Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
Change the way UPDATEs are handled. Instead of maintaining a chain of
tuple-level locks in shared memory, copy any existing locks on the old
tuple to the new tuple at UPDATE. Any existing page-level lock needs to
be duplicated too, as a lock on the new tuple. That was neglected
previously.
Store xmin on tuple-level predicate locks, to distinguish a lock on an old
already-recycled tuple from a new tuple at the same physical location.
Failure to distinguish them caused loops in the tuple-lock chains, as
reported by YAMAMOTO Takashi. Although we don't use the chain representation
of UPDATEs anymore, it seems like a good idea to store the xmin to avoid
some false positives if no other reason.
CheckSingleTargetForConflictsIn now correctly handles the case where a lock
that's being held is not reflected in the local lock table. That happens
if another backend acquires a lock on our behalf due to an UPDATE or a page
split.
PredicateLockPageCombine now retains locks for the page that is being
removed, rather than removing them. This prevents a potentially dangerous
false-positive inconsistency where the local lock table believes that a lock
is held, but it is actually not.
Dan Ports and Kevin Grittner
They share the same locking namespace with the existing session-level
advisory locks, but they are automatically released at the end of the
current transaction and cannot be released explicitly via unlock
functions.
Marko Tiikkaja, reviewed by me.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.
Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
All retryable conflict errors now have an error code that indicates that
a retry is possible, correcting my incomplete fix of 2010/05/12
Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug
When we need to insert a new entry and the queue is full, compact the
entire queue in the hopes of making room for the new entry. Doing this
on every insertion might worsen contention on BgWriterCommLock, but
when the queue it's full, it's far better than allowing the backend to
perform its own fsync, per testing by Greg Smith as reported in
http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php
Original idea from Greg Smith. Patch by me. Review by Chris Browne
and Greg Smith
In the case where the initial call of systable_getnext_ordered() returned
NULL, this function would nonetheless call it again. That's undefined
behavior that only by chance failed to not give visibly incorrect results.
Put an if-test around the final loop to prevent that, and in passing
improve some comments. No back-patch since there's no actual failure.
Per report from YAMAMOTO Takashi.