1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-04 20:11:56 +03:00

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reported by Smolkin Grigory.  Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.

Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
This commit is contained in:
Noah Misch
2024-09-24 15:25:18 -07:00
parent 41e0ba33d5
commit 8590c942c1
13 changed files with 484 additions and 94 deletions

View File

@@ -153,3 +153,14 @@ The following infomask bits are applicable:
We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
is set.
Reading inplace-updated columns
-------------------------------
Inplace updates create an exception to the rule that tuple data won't change
under a reader holding a pin. A reader of a heap_fetch() result tuple may
witness a torn read. Current inplace-updated fields are aligned and are no
wider than four bytes, and current readers don't need consistency across
fields. Hence, they get by with just fetching each field once. XXX such a
caller may also read a value that has not reached WAL; see
systable_inplace_update_finish().

View File

@@ -5991,23 +5991,245 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
}
/*
* heap_inplace_update - update a tuple "in place" (ie, overwrite it)
* heap_inplace_lock - protect inplace update from concurrent heap_update()
*
* Overwriting violates both MVCC and transactional safety, so the uses
* of this function in Postgres are extremely limited. Nonetheless we
* find some places to use it.
* Evaluate whether the tuple's state is compatible with a no-key update.
* Current transaction rowmarks are fine, as is KEY SHARE from any
* transaction. If compatible, return true with the buffer exclusive-locked,
* and the caller must release that by calling
* heap_inplace_update_and_unlock(), calling heap_inplace_unlock(), or raising
* an error. Otherwise, return false after blocking transactions, if any,
* have ended.
*
* The tuple cannot change size, and therefore it's reasonable to assume
* that its null bitmap (if any) doesn't change either. So we just
* overwrite the data portion of the tuple without touching the null
* bitmap or any of the header fields.
* Since this is intended for system catalogs and SERIALIZABLE doesn't cover
* DDL, this doesn't guarantee any particular predicate locking.
*
* tuple is an in-memory tuple structure containing the data to be written
* over the target tuple. Also, tuple->t_self identifies the target tuple.
* One could modify this to return true for tuples with delete in progress,
* All inplace updaters take a lock that conflicts with DROP. If explicit
* "DELETE FROM pg_class" is in progress, we'll wait for it like we would an
* update.
*
* Note that the tuple updated here had better not come directly from the
* syscache if the relation has a toast relation as this tuple could
* include toast values that have been expanded, causing a failure here.
* Readers of inplace-updated fields expect changes to those fields are
* durable. For example, vac_truncate_clog() reads datfrozenxid from
* pg_database tuples via catalog snapshots. A future snapshot must not
* return a lower datfrozenxid for the same database OID (lower in the
* FullTransactionIdPrecedes() sense). We achieve that since no update of a
* tuple can start while we hold a lock on its buffer. In cases like
* BEGIN;GRANT;CREATE INDEX;COMMIT we're inplace-updating a tuple visible only
* to this transaction. ROLLBACK then is one case where it's okay to lose
* inplace updates. (Restoring relhasindex=false on ROLLBACK is fine, since
* any concurrent CREATE INDEX would have blocked, then inplace-updated the
* committed tuple.)
*
* In principle, we could avoid waiting by overwriting every tuple in the
* updated tuple chain. Reader expectations permit updating a tuple only if
* it's aborted, is the tail of the chain, or we already updated the tuple
* referenced in its t_ctid. Hence, we would need to overwrite the tuples in
* order from tail to head. That would imply either (a) mutating all tuples
* in one critical section or (b) accepting a chance of partial completion.
* Partial completion of a relfrozenxid update would have the weird
* consequence that the table's next VACUUM could see the table's relfrozenxid
* move forward between vacuum_get_cutoffs() and finishing.
*/
bool
heap_inplace_lock(Relation relation,
HeapTuple oldtup_ptr, Buffer buffer)
{
HeapTupleData oldtup = *oldtup_ptr; /* minimize diff vs. heap_update() */
TM_Result result;
bool ret;
Assert(BufferIsValid(buffer));
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
/*----------
* Interpret HeapTupleSatisfiesUpdate() like heap_update() does, except:
*
* - wait unconditionally
* - no tuple locks
* - don't recheck header after wait: simpler to defer to next iteration
* - don't try to continue even if the updater aborts: likewise
* - no crosscheck
*/
result = HeapTupleSatisfiesUpdate(&oldtup, GetCurrentCommandId(false),
buffer);
if (result == TM_Invisible)
{
/* no known way this can happen */
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg_internal("attempted to overwrite invisible tuple")));
}
else if (result == TM_SelfModified)
{
/*
* CREATE INDEX might reach this if an expression is silly enough to
* call e.g. SELECT ... FROM pg_class FOR SHARE. C code of other SQL
* statements might get here after a heap_update() of the same row, in
* the absence of an intervening CommandCounterIncrement().
*/
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("tuple to be updated was already modified by an operation triggered by the current command")));
}
else if (result == TM_BeingModified)
{
TransactionId xwait;
uint16 infomask;
xwait = HeapTupleHeaderGetRawXmax(oldtup.t_data);
infomask = oldtup.t_data->t_infomask;
if (infomask & HEAP_XMAX_IS_MULTI)
{
LockTupleMode lockmode = LockTupleNoKeyExclusive;
MultiXactStatus mxact_status = MultiXactStatusNoKeyUpdate;
int remain;
bool current_is_member;
if (DoesMultiXactIdConflict((MultiXactId) xwait, infomask,
lockmode, &current_is_member))
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ret = false;
MultiXactIdWait((MultiXactId) xwait, mxact_status, infomask,
relation, &oldtup.t_self, XLTW_Update,
&remain);
}
else
ret = true;
}
else if (TransactionIdIsCurrentTransactionId(xwait))
ret = true;
else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
ret = true;
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ret = false;
XactLockTableWait(xwait, relation, &oldtup.t_self,
XLTW_Update);
}
}
else
{
ret = (result == TM_Ok);
if (!ret)
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
}
/*
* GetCatalogSnapshot() relies on invalidation messages to know when to
* take a new snapshot. COMMIT of xwait is responsible for sending the
* invalidation. We're not acquiring heavyweight locks sufficient to
* block if not yet sent, so we must take a new snapshot to ensure a later
* attempt has a fair chance. While we don't need this if xwait aborted,
* don't bother optimizing that.
*/
if (!ret)
InvalidateCatalogSnapshot();
return ret;
}
/*
* heap_inplace_update_and_unlock - core of systable_inplace_update_finish
*
* The tuple cannot change size, and therefore its header fields and null
* bitmap (if any) don't change either.
*/
void
heap_inplace_update_and_unlock(Relation relation,
HeapTuple oldtup, HeapTuple tuple,
Buffer buffer)
{
HeapTupleHeader htup = oldtup->t_data;
uint32 oldlen;
uint32 newlen;
Assert(ItemPointerEquals(&oldtup->t_self, &tuple->t_self));
oldlen = oldtup->t_len - htup->t_hoff;
newlen = tuple->t_len - tuple->t_data->t_hoff;
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
elog(ERROR, "wrong tuple length");
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
memcpy((char *) htup + htup->t_hoff,
(char *) tuple->t_data + tuple->t_data->t_hoff,
newlen);
/*----------
* XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
*
* ["D" is a VACUUM (ONLY_DATABASE_STATS)]
* ["R" is a VACUUM tbl]
* D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
* D: systable_getnext() returns pg_class tuple of tbl
* R: memcpy() into pg_class tuple of tbl
* D: raise pg_database.datfrozenxid, XLogInsert(), finish
* [crash]
* [recovery restores datfrozenxid w/o relfrozenxid]
*/
MarkBufferDirty(buffer);
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
/* inplace updates aren't decoded atm, don't log the origin */
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
PageSetLSN(BufferGetPage(buffer), recptr);
}
END_CRIT_SECTION();
heap_inplace_unlock(relation, oldtup, buffer);
/*
* Send out shared cache inval if necessary. Note that because we only
* pass the new version of the tuple, this mustn't be used for any
* operations that could change catcache lookup keys. But we aren't
* bothering with index updates either, so that's true a fortiori.
*
* XXX ROLLBACK discards the invalidation. See test inplace-inval.spec.
*/
if (!IsBootstrapProcessingMode())
CacheInvalidateHeapTuple(relation, tuple, NULL);
}
/*
* heap_inplace_unlock - reverse of heap_inplace_lock
*/
void
heap_inplace_unlock(Relation relation,
HeapTuple oldtup, Buffer buffer)
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
/*
* heap_inplace_update - deprecated
*
* This exists only to keep modules working in back branches. Affected
* modules should migrate to systable_inplace_update_begin().
*/
void
heap_inplace_update(Relation relation, HeapTuple tuple)

View File

@@ -24,6 +24,7 @@
#include "access/relscan.h"
#include "access/tableam.h"
#include "access/transam.h"
#include "catalog/catalog.h"
#include "catalog/index.h"
#include "lib/stringinfo.h"
#include "miscadmin.h"
@@ -748,3 +749,139 @@ systable_endscan_ordered(SysScanDesc sysscan)
UnregisterSnapshot(sysscan->snapshot);
pfree(sysscan);
}
/*
* systable_inplace_update_begin --- update a row "in place" (overwrite it)
*
* Overwriting violates both MVCC and transactional safety, so the uses of
* this function in Postgres are extremely limited. Nonetheless we find some
* places to use it. Standard flow:
*
* ... [any slow preparation not requiring oldtup] ...
* systable_inplace_update_begin([...], &tup, &inplace_state);
* if (!HeapTupleIsValid(tup))
* elog(ERROR, [...]);
* ... [buffer is exclusive-locked; mutate "tup"] ...
* if (dirty)
* systable_inplace_update_finish(inplace_state, tup);
* else
* systable_inplace_update_cancel(inplace_state);
*
* The first several params duplicate the systable_beginscan() param list.
* "oldtupcopy" is an output parameter, assigned NULL if the key ceases to
* find a live tuple. (In PROC_IN_VACUUM, that is a low-probability transient
* condition.) If "oldtupcopy" gets non-NULL, you must pass output parameter
* "state" to systable_inplace_update_finish() or
* systable_inplace_update_cancel().
*/
void
systable_inplace_update_begin(Relation relation,
Oid indexId,
bool indexOK,
Snapshot snapshot,
int nkeys, const ScanKeyData *key,
HeapTuple *oldtupcopy,
void **state)
{
ScanKey mutable_key = palloc(sizeof(ScanKeyData) * nkeys);
int retries = 0;
SysScanDesc scan;
HeapTuple oldtup;
/*
* For now, we don't allow parallel updates. Unlike a regular update,
* this should never create a combo CID, so it might be possible to relax
* this restriction, but not without more thought and testing. It's not
* clear that it would be useful, anyway.
*/
if (IsInParallelMode())
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
/*
* Accept a snapshot argument, for symmetry, but this function advances
* its snapshot as needed to reach the tail of the updated tuple chain.
*/
Assert(snapshot == NULL);
Assert(IsInplaceUpdateRelation(relation) || !IsSystemRelation(relation));
/* Loop for an exclusive-locked buffer of a non-updated tuple. */
for (;;)
{
TupleTableSlot *slot;
BufferHeapTupleTableSlot *bslot;
CHECK_FOR_INTERRUPTS();
/*
* Processes issuing heap_update (e.g. GRANT) at maximum speed could
* drive us to this error. A hostile table owner has stronger ways to
* damage their own table, so that's minor.
*/
if (retries++ > 10000)
elog(ERROR, "giving up after too many tries to overwrite row");
memcpy(mutable_key, key, sizeof(ScanKeyData) * nkeys);
scan = systable_beginscan(relation, indexId, indexOK, snapshot,
nkeys, mutable_key);
oldtup = systable_getnext(scan);
if (!HeapTupleIsValid(oldtup))
{
systable_endscan(scan);
*oldtupcopy = NULL;
return;
}
slot = scan->slot;
Assert(TTS_IS_BUFFERTUPLE(slot));
bslot = (BufferHeapTupleTableSlot *) slot;
if (heap_inplace_lock(scan->heap_rel,
bslot->base.tuple, bslot->buffer))
break;
systable_endscan(scan);
};
*oldtupcopy = heap_copytuple(oldtup);
*state = scan;
}
/*
* systable_inplace_update_finish --- second phase of inplace update
*
* The tuple cannot change size, and therefore its header fields and null
* bitmap (if any) don't change either.
*/
void
systable_inplace_update_finish(void *state, HeapTuple tuple)
{
SysScanDesc scan = (SysScanDesc) state;
Relation relation = scan->heap_rel;
TupleTableSlot *slot = scan->slot;
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
HeapTuple oldtup = bslot->base.tuple;
Buffer buffer = bslot->buffer;
heap_inplace_update_and_unlock(relation, oldtup, tuple, buffer);
systable_endscan(scan);
}
/*
* systable_inplace_update_cancel --- abandon inplace update
*
* This is an alternative to making a no-op update.
*/
void
systable_inplace_update_cancel(void *state)
{
SysScanDesc scan = (SysScanDesc) state;
Relation relation = scan->heap_rel;
TupleTableSlot *slot = scan->slot;
BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
HeapTuple oldtup = bslot->base.tuple;
Buffer buffer = bslot->buffer;
heap_inplace_unlock(relation, oldtup, buffer);
systable_endscan(scan);
}