mirror of
https://github.com/postgres/postgres.git
synced 2025-05-17 06:41:24 +03:00
A buffer lock won't stop a reader having already checked tuple visibility. If a vac_update_datfrozenid() and then a crash happened during inplace update of a relfrozenxid value, datfrozenxid could overtake relfrozenxid. That could lead to "could not access status of transaction" errors. Back-patch to v12 (all supported versions). In v14 and earlier, this also back-patches the assertion removal from commit 7fcf2faf9c7dd473208fd6d5565f88d7f733782b. Discussion: https://postgr.es/m/20240620012908.92.nmisch@google.com
207 lines
11 KiB
Plaintext
207 lines
11 KiB
Plaintext
Locking tuples
|
|
--------------
|
|
|
|
Locking tuples is not as easy as locking tables or other database objects.
|
|
The problem is that transactions might want to lock large numbers of tuples at
|
|
any one time, so it's not possible to keep the locks objects in shared memory.
|
|
To work around this limitation, we use a two-level mechanism. The first level
|
|
is implemented by storing locking information in the tuple header: a tuple is
|
|
marked as locked by setting the current transaction's XID as its XMAX, and
|
|
setting additional infomask bits to distinguish this case from the more normal
|
|
case of having deleted the tuple. When multiple transactions concurrently
|
|
lock a tuple, a MultiXact is used; see below. This mechanism can accommodate
|
|
arbitrarily large numbers of tuples being locked simultaneously.
|
|
|
|
When it is necessary to wait for a tuple-level lock to be released, the basic
|
|
delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
|
|
the tuple's XMAX. However, that mechanism will release all waiters
|
|
concurrently, so there would be a race condition as to which waiter gets the
|
|
tuple, potentially leading to indefinite starvation of some waiters. The
|
|
possibility of share-locking makes the problem much worse --- a steady stream
|
|
of share-lockers can easily block an exclusive locker forever. To provide
|
|
more reliable semantics about who gets a tuple-level lock first, we use the
|
|
standard lock manager, which implements the second level mentioned above. The
|
|
protocol for waiting for a tuple-level lock is really
|
|
|
|
LockTuple()
|
|
XactLockTableWait()
|
|
mark tuple as locked by me
|
|
UnlockTuple()
|
|
|
|
When there are multiple waiters, arbitration of who is to get the lock next
|
|
is provided by LockTuple(). However, at most one tuple-level lock will
|
|
be held or awaited per backend at any time, so we don't risk overflow
|
|
of the lock table. Note that incoming share-lockers are required to
|
|
do LockTuple as well, if there is any conflict, to ensure that they don't
|
|
starve out waiting exclusive-lockers. However, if there is not any active
|
|
conflict for a tuple, we don't incur any extra overhead.
|
|
|
|
We make an exception to the above rule for those lockers that already hold
|
|
some lock on a tuple and attempt to acquire a stronger one on it. In that
|
|
case, we skip the LockTuple() call even when there are conflicts, provided
|
|
that the target tuple is being locked, updated or deleted by multiple sessions
|
|
concurrently. Failing to skip the lock would risk a deadlock, e.g., between a
|
|
session that was first to record its weaker lock in the tuple header and would
|
|
be waiting on the LockTuple() call to upgrade to the stronger lock level, and
|
|
another session that has already done LockTuple() and is waiting for the first
|
|
session transaction to release its tuple header-level lock.
|
|
|
|
We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
|
|
exclusive lock which prevents any kind of modification of the tuple. This is
|
|
the lock level that is implicitly taken by DELETE operations, and also by
|
|
UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
|
|
KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
|
|
and modifications which might alter the tuple's key. This is the lock that is
|
|
implicitly taken by UPDATE operations which leave all key fields unchanged.
|
|
SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
|
|
modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
|
|
prevents tuple removal and modifications of key fields. This lock level is
|
|
just strong enough to implement RI checks, i.e. it ensures that tuples do not
|
|
go away from under a check, without blocking transactions that want to update
|
|
the tuple without changing its key.
|
|
|
|
The conflict table is:
|
|
|
|
UPDATE NO KEY UPDATE SHARE KEY SHARE
|
|
UPDATE conflict conflict conflict conflict
|
|
NO KEY UPDATE conflict conflict conflict
|
|
SHARE conflict conflict
|
|
KEY SHARE conflict
|
|
|
|
When there is a single locker in a tuple, we can just store the locking info
|
|
in the tuple itself. We do this by storing the locker's Xid in XMAX, and
|
|
setting infomask bits specifying the locking strength. There is one exception
|
|
here: since infomask space is limited, we do not provide a separate bit
|
|
for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
|
|
that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
|
|
presumably more commonly used due to being the standards-mandated locking
|
|
mechanism, or heavily used by the RI code, so we want to provide fast paths
|
|
for those.)
|
|
|
|
MultiXacts
|
|
----------
|
|
|
|
A tuple header provides very limited space for storing information about tuple
|
|
locking and updates: there is room only for a single Xid and a small number of
|
|
infomask bits. Whenever we need to store more than one lock, we replace the
|
|
first locker's Xid with a new MultiXactId. Each MultiXact provides extended
|
|
locking data; it comprises an array of Xids plus some flags bits for each one.
|
|
The flags are currently used to store the locking strength of each member
|
|
transaction. (The flags also distinguish a pure locker from an updater.)
|
|
|
|
In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
|
|
locked in shared mode by multiple transactions. This is no longer the case; a
|
|
MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks
|
|
in a transaction do not conflict with other tuple locks in the same
|
|
transaction, so it's possible to have otherwise conflicting locks in a
|
|
MultiXact if they belong to the same transaction).
|
|
|
|
Note that each lock is attributed to the subtransaction that acquires it.
|
|
This means that a subtransaction that aborts is seen as though it releases the
|
|
locks it acquired; concurrent transactions can then proceed without having to
|
|
wait for the main transaction to finish. It also means that a subtransaction
|
|
can upgrade to a stronger lock level than an earlier transaction had, and if
|
|
the subxact aborts, the earlier, weaker lock is kept.
|
|
|
|
The possibility of having an update within a MultiXact means that they must
|
|
persist across crashes and restarts: a future reader of the tuple needs to
|
|
figure out whether the update committed or aborted. So we have a requirement
|
|
that pg_multixact needs to retain pages of its data until we're certain that
|
|
the MultiXacts in them are no longer of interest.
|
|
|
|
VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
|
|
The lower bound used by vacuum (that is, the value below which all multixacts
|
|
are removed) is stored as pg_class.relminmxid for each table; the minimum of
|
|
all such values is stored in pg_database.datminmxid. The minimum across
|
|
all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
|
|
removes pg_multixact/ segments older than that value once the checkpoint
|
|
record has been flushed.
|
|
|
|
Infomask Bits
|
|
-------------
|
|
|
|
The following infomask bits are applicable:
|
|
|
|
- HEAP_XMAX_INVALID
|
|
Any tuple with this bit set does not have a valid value stored in XMAX.
|
|
|
|
- HEAP_XMAX_IS_MULTI
|
|
This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
|
|
regular TransactionId).
|
|
|
|
- HEAP_XMAX_LOCK_ONLY
|
|
This bit is set when the XMAX is a locker only; that is, if it's a
|
|
multixact, it does not contain an update among its members. It's set when
|
|
the XMAX is a plain Xid that locked the tuple, as well.
|
|
|
|
- HEAP_XMAX_KEYSHR_LOCK
|
|
- HEAP_XMAX_SHR_LOCK
|
|
- HEAP_XMAX_EXCL_LOCK
|
|
These bits indicate the strength of the lock acquired; they are useful when
|
|
the XMAX is not a MultiXactId. If it's a multi, the info is to be found in
|
|
the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
|
|
is set, then one of these *must* be set as well.
|
|
|
|
Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
|
|
FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.
|
|
|
|
- HEAP_KEYS_UPDATED
|
|
This bit lives in t_infomask2. If set, indicates that the operation(s) done
|
|
by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
|
|
that modifies the columns of the key, or a DELETE. It's set regardless of
|
|
whether the XMAX is a TransactionId or a MultiXactId.
|
|
|
|
We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
|
|
is set.
|
|
|
|
Locking to write inplace-updated tables
|
|
---------------------------------------
|
|
|
|
If IsInplaceUpdateRelation() returns true for a table, the table is a system
|
|
catalog that receives systable_inplace_update_begin() calls. Preparing a
|
|
heap_update() of these tables follows additional locking rules, to ensure we
|
|
don't lose the effects of an inplace update. In particular, consider a moment
|
|
when a backend has fetched the old tuple to modify, not yet having called
|
|
heap_update(). Another backend's inplace update starting then can't conclude
|
|
until the heap_update() places its new tuple in a buffer. We enforce that
|
|
using locktags as follows. While DDL code is the main audience, the executor
|
|
follows these rules to make e.g. "MERGE INTO pg_class" safer. Locking rules
|
|
are per-catalog:
|
|
|
|
pg_class systable_inplace_update_begin() callers: before the call, acquire a
|
|
lock on the relation in mode ShareUpdateExclusiveLock or stricter. If the
|
|
update targets a row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX),
|
|
that lock must be on the table. Locking the index rel is not necessary.
|
|
(This allows VACUUM to overwrite per-index pg_class while holding a lock on
|
|
the table alone.) systable_inplace_update_begin() acquires and releases
|
|
LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for ExclusiveLock, on each
|
|
tuple it overwrites.
|
|
|
|
pg_class heap_update() callers: before copying the tuple to modify, take a
|
|
lock on the tuple, a ShareUpdateExclusiveLock on the relation, or a
|
|
ShareRowExclusiveLock or stricter on the relation.
|
|
|
|
SearchSysCacheLocked1() is one convenient way to acquire the tuple lock.
|
|
Most heap_update() callers already hold a suitable lock on the relation for
|
|
other reasons and can skip the tuple lock. If you do acquire the tuple
|
|
lock, release it immediately after the update.
|
|
|
|
|
|
pg_database: before copying the tuple to modify, all updaters of pg_database
|
|
rows acquire LOCKTAG_TUPLE. (Few updaters acquire LOCKTAG_OBJECT on the
|
|
database OID, so it wasn't worth extending that as a second option.)
|
|
|
|
Ideally, DDL might want to perform permissions checks before LockTuple(), as
|
|
we do with RangeVarGetRelidExtended() callbacks. We typically don't bother.
|
|
LOCKTAG_TUPLE acquirers release it after each row, so the potential
|
|
inconvenience is lower.
|
|
|
|
Reading inplace-updated columns
|
|
-------------------------------
|
|
|
|
Inplace updates create an exception to the rule that tuple data won't change
|
|
under a reader holding a pin. A reader of a heap_fetch() result tuple may
|
|
witness a torn read. Current inplace-updated fields are aligned and are no
|
|
wider than four bytes, and current readers don't need consistency across
|
|
fields. Hence, they get by with just fetching each field once.
|