1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-15 03:41:20 +03:00

Fix multiple problems in WAL replay.

Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states.  This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.

The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page.  This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates.  Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another.  Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.

To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time.  This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images.  We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods.  A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.

In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images.  In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4.  They are
still there in the header files in previous branches, but are no longer
used by the code.

In addition, fix some other bugs identified in the course of making these
changes:

spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already.  This would result in permanent index corruption,
not just transient failure of concurrent queries.

Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image.  A plain exclusive lock seems sufficient, so change to
that.

Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.

Back-patch to 9.0, where hot standby was introduced.  Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records.  Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
This commit is contained in:
Tom Lane
2012-11-12 22:05:08 -05:00
parent 9b3ac49e5a
commit 3bbf668de9
10 changed files with 761 additions and 472 deletions

View File

@@ -438,8 +438,9 @@ critical section.)
4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
5. Build a WAL log record and pass it to XLogInsert(); then update the page's
LSN and TLI using the returned XLOG location. For instance,
5. If the relation requires WAL-logging, build a WAL log record and pass it
to XLogInsert(); then update the page's LSN and TLI using the returned XLOG
location. For instance,
recptr = XLogInsert(rmgr_id, info, rdata);
@@ -466,9 +467,9 @@ which buffers were handled that way --- otherwise they may be misled about
what the XLOG record actually contains. XLOG records that describe multi-page
changes therefore require some care to design: you must be certain that you
know what data is indicated by each "BKP" bit. An example of the trickiness
is that in a HEAP_UPDATE record, BKP(1) normally is associated with the source
page and BKP(2) is associated with the destination page --- but if these are
the same page, only BKP(1) would have been set.
is that in a HEAP_UPDATE record, BKP(0) normally is associated with the source
page and BKP(1) is associated with the destination page --- but if these are
the same page, only BKP(0) would have been set.
For this reason as well as the risk of deadlocking on buffer locks, it's best
to design WAL records so that they reflect small atomic actions involving just
@@ -497,12 +498,19 @@ incrementally update the page, the rdata array *must* mention the buffer
ID at least once; otherwise there is no defense against torn-page problems.
The standard replay-routine pattern for this case is
if (record->xl_info & XLR_BKP_BLOCK_n)
<< do nothing, page was rewritten from logged copy >>;
if (record->xl_info & XLR_BKP_BLOCK(N))
{
/* apply the change from the full-page image */
(void) RestoreBackupBlock(lsn, record, N, false, false);
return;
}
buffer = XLogReadBuffer(rnode, blkno, false);
if (!BufferIsValid(buffer))
<< do nothing, page has been deleted >>;
{
/* page has been deleted, so we need do nothing */
return;
}
page = (Page) BufferGetPage(buffer);
if (XLByteLE(lsn, PageGetLSN(page)))
@@ -520,13 +528,42 @@ The standard replay-routine pattern for this case is
UnlockReleaseBuffer(buffer);
As noted above, for a multi-page update you need to be able to determine
which XLR_BKP_BLOCK_n flag applies to each page. If a WAL record reflects
which XLR_BKP_BLOCK(N) flag applies to each page. If a WAL record reflects
a combination of fully-rewritable and incremental updates, then the rewritable
pages don't count for the XLR_BKP_BLOCK_n numbering. (XLR_BKP_BLOCK_n is
associated with the n'th distinct buffer ID seen in the "rdata" array, and
pages don't count for the XLR_BKP_BLOCK(N) numbering. (XLR_BKP_BLOCK(N) is
associated with the N'th distinct buffer ID seen in the "rdata" array, and
per the above discussion, fully-rewritable buffers shouldn't be mentioned in
"rdata".)
When replaying a WAL record that describes changes on multiple pages, you
must be careful to lock the pages properly to prevent concurrent Hot Standby
queries from seeing an inconsistent state. If this requires that two
or more buffer locks be held concurrently, the coding pattern shown above
is too simplistic, since it assumes the routine can exit as soon as it's
known the current page requires no modification. Instead, you might have
something like
if (record->xl_info & XLR_BKP_BLOCK(0))
{
/* apply the change from the full-page image */
buffer0 = RestoreBackupBlock(lsn, record, 0, false, true);
}
else
{
buffer0 = XLogReadBuffer(rnode, blkno, false);
if (BufferIsValid(buffer0))
{
... apply the change if not already done ...
MarkBufferDirty(buffer0);
}
}
... similarly apply the changes for remaining pages ...
/* and now we can release the lock on the first page */
if (BufferIsValid(buffer0))
UnlockReleaseBuffer(buffer0);
Due to all these constraints, complex changes (such as a multilevel index
insertion) normally need to be described by a series of atomic-action WAL
records. What do you do if the intermediate states are not self-consistent?

View File

@@ -835,8 +835,8 @@ begin:;
* At the exit of this loop, write_len includes the backup block data.
*
* Also set the appropriate info bits to show which buffers were backed
* up. The i'th XLR_SET_BKP_BLOCK bit corresponds to the i'th distinct
* buffer value (ignoring InvalidBuffer) appearing in the rdata chain.
* up. The XLR_BKP_BLOCK(N) bit corresponds to the N'th distinct buffer
* value (ignoring InvalidBuffer) appearing in the rdata chain.
*/
rdt_lastnormal = rdt;
write_len = len;
@@ -848,7 +848,7 @@ begin:;
if (!dtbuf_bkp[i])
continue;
info |= XLR_SET_BKP_BLOCK(i);
info |= XLR_BKP_BLOCK(i);
bkpb = &(dtbuf_xlg[i]);
page = (char *) BufferGetBlock(dtbuf[i]);
@@ -3080,9 +3080,16 @@ CleanupBackupHistory(void)
}
/*
* Restore the backup blocks present in an XLOG record, if any.
* Restore a full-page image from a backup block attached to an XLOG record.
*
* We assume all of the record has been read into memory at *record.
* lsn: LSN of the XLOG record being replayed
* record: the complete XLOG record
* block_index: which backup block to restore (0 .. XLR_MAX_BKP_BLOCKS - 1)
* get_cleanup_lock: TRUE to get a cleanup rather than plain exclusive lock
* keep_buffer: TRUE to return the buffer still locked and pinned
*
* Returns the buffer number containing the page. Note this is not terribly
* useful unless keep_buffer is specified as TRUE.
*
* Note: when a backup block is available in XLOG, we restore it
* unconditionally, even if the page in the database appears newer.
@@ -3093,15 +3100,20 @@ CleanupBackupHistory(void)
* modifications of the page that appear in XLOG, rather than possibly
* ignoring them as already applied, but that's not a huge drawback.
*
* If 'cleanup' is true, a cleanup lock is used when restoring blocks.
* Otherwise, a normal exclusive lock is used. During crash recovery, that's
* just pro forma because there can't be any regular backends in the system,
* but in hot standby mode the distinction is important. The 'cleanup'
* argument applies to all backup blocks in the WAL record, that suffices for
* now.
* If 'get_cleanup_lock' is true, a cleanup lock is obtained on the buffer,
* else a normal exclusive lock is used. During crash recovery, that's just
* pro forma because there can't be any regular backends in the system, but
* in hot standby mode the distinction is important.
*
* If 'keep_buffer' is true, return without releasing the buffer lock and pin;
* then caller is responsible for doing UnlockReleaseBuffer() later. This
* is needed in some cases when replaying XLOG records that touch multiple
* pages, to prevent inconsistent states from being visible to other backends.
* (Again, that's only important in hot standby mode.)
*/
void
RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
Buffer
RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
bool get_cleanup_lock, bool keep_buffer)
{
Buffer buffer;
Page page;
@@ -3109,49 +3121,59 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
char *blk;
int i;
if (!(record->xl_info & XLR_BKP_BLOCK_MASK))
return;
/* Locate requested BkpBlock in the record */
blk = (char *) XLogRecGetData(record) + record->xl_len;
for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
{
if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
if (!(record->xl_info & XLR_BKP_BLOCK(i)))
continue;
memcpy(&bkpb, blk, sizeof(BkpBlock));
blk += sizeof(BkpBlock);
buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
RBM_ZERO);
Assert(BufferIsValid(buffer));
if (cleanup)
LockBufferForCleanup(buffer);
else
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
if (bkpb.hole_length == 0)
if (i == block_index)
{
memcpy((char *) page, blk, BLCKSZ);
}
else
{
memcpy((char *) page, blk, bkpb.hole_offset);
/* must zero-fill the hole */
MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
/* Found it, apply the update */
buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,
RBM_ZERO);
Assert(BufferIsValid(buffer));
if (get_cleanup_lock)
LockBufferForCleanup(buffer);
else
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
page = (Page) BufferGetPage(buffer);
if (bkpb.hole_length == 0)
{
memcpy((char *) page, blk, BLCKSZ);
}
else
{
memcpy((char *) page, blk, bkpb.hole_offset);
/* must zero-fill the hole */
MemSet((char *) page + bkpb.hole_offset, 0, bkpb.hole_length);
memcpy((char *) page + (bkpb.hole_offset + bkpb.hole_length),
blk + bkpb.hole_offset,
BLCKSZ - (bkpb.hole_offset + bkpb.hole_length));
}
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);
if (!keep_buffer)
UnlockReleaseBuffer(buffer);
return buffer;
}
blk += BLCKSZ - bkpb.hole_length;
}
/* Caller specified a bogus block_index */
elog(ERROR, "failed to restore block_index %d", block_index);
return InvalidBuffer; /* keep compiler quiet */
}
/*
@@ -3193,7 +3215,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
{
uint32 blen;
if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
if (!(record->xl_info & XLR_BKP_BLOCK(i)))
continue;
if (remaining < sizeof(BkpBlock))
@@ -8081,7 +8103,8 @@ xlog_outrec(StringInfo buf, XLogRecord *record)
int i;
appendStringInfo(buf, "prev %X/%X; xid %u",
(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
(uint32) (record->xl_prev >> 32),
(uint32) record->xl_prev,
record->xl_xid);
appendStringInfo(buf, "; len %u",
@@ -8089,8 +8112,8 @@ xlog_outrec(StringInfo buf, XLogRecord *record)
for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
{
if (record->xl_info & XLR_SET_BKP_BLOCK(i))
appendStringInfo(buf, "; bkpb%d", i + 1);
if (record->xl_info & XLR_BKP_BLOCK(i))
appendStringInfo(buf, "; bkpb%d", i);
}
appendStringInfo(buf, ": %s", RmgrTable[record->xl_rmid].rm_name);