Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.
Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.
Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
Every redo routine uses the same idiom to determine what to do to a page:
check if there's a backup block for it, and if not read, the buffer if the
block exists, and check its LSN. Refactor that into a common function,
XLogReadBufferForRedo, making all the redo routines shorter and more
readable.
This has no user-visible effect, and makes no changes to the WAL format.
Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
log_newpage is used by many indexams, in addition to heap, but for
historical reasons it's always been part of the heapam rmgr. Starting with
9.3, we have another WAL record type for logging an image of a page,
XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to
xlog.c, and switch to using the XLOG_FPI record type.
Bump the WAL version number because the code to replay the old HEAP_NEWPAGE
records is removed.
Previously, the code used a node label of zero both for strings that
contain no bytes beyond the inner tuple's prefix, and for cases where an
"allTheSame" inner tuple has to be split to allow a string with a different
next byte to be inserted into it. Failing to distinguish these cases meant
that if a string ending with the current prefix needed to be inserted into
an allTheSame tuple, we got into an infinite loop, because after splitting
the tuple we'd descend into the child allTheSame tuple and then find we
need to split again.
To fix, instead use -1 and -2 as the node labels for these two cases.
This requires widening the node label type from "char" to int2, but
fortunately SPGiST stores all pass-by-value node label types in their
Datum representation, which means that this change is transparently upward
compatible so far as the on-disk representation goes. We continue to
recognize zero as a dummy node label for reading purposes, but will not
attempt to push new index entries down into such a label, so that the loop
won't occur even when dealing with an existing index.
Per report from Teodor Sigaev. Back-patch to 9.2 where the faulty
code was introduced.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.
Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.
Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has". Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64). So let's
start using "z" instead. To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".
Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead. This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them. I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.
It's possible that this change will result in format-string warnings from
pre-C99 compilers. We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.
Andres Freund, with a little additional work by me
The original coding thought this case was impossible, but it can happen
if the bgwriter or checkpointer processes decide to write out an index
page while creation is still proceeding, leading to a bogus "unexpected
spgdoinsert() failure" error. Problem reported by Jonathan S. Katz.
Teodor Sigaev
SP-GiST's original scheme for avoiding deadlocks during concurrent index
insertions doesn't work, as per report from Hailong Li, and there isn't any
evident way to make it work completely. We could possibly lock individual
inner tuples instead of their whole pages, but preliminary experimentation
suggests that the performance penalty would be huge. Instead, if we fail
to get a buffer lock while descending the tree, just restart the tree
descent altogether. We keep the old tuple positioning rules, though, in
hopes of reducing the number of cases where this can happen.
Teodor Sigaev, somewhat edited by Tom Lane
What we have implemented is a radix tree (or a radix trie or a patricia
trie), but the docs and code comments incorrectly called it a "suffix tree".
Alexander Korotkov
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.
WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.
Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.
Checksum used is WAL CRC-32 truncated to 16-bits.
Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.
Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.
Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter. The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway. The clarity of simpler expressions
makes the change worthwhile.
Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).
Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states. This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.
The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page. This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates. Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another. Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.
To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time. This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images. We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods. A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.
In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images. In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are
still there in the header files in previous branches, but are no longer
used by the code.
In addition, fix some other bugs identified in the course of making these
changes:
spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already. This would result in permanent index corruption,
not just transient failure of concurrent queries.
Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.
heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image. A plain exclusive lock seems sufficient, so change to
that.
Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.
Back-patch to 9.0, where hot standby was introduced. Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records. Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
This includes two micro-optimizations to the tight inner loop in descending
the SP-GiST tree: 1. avoid an extra function call to index_getprocinfo when
calling user-defined choose function, and 2. avoid a useless palloc+pfree
when node labels are not used.
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API. With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.
Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers. This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
In yesterday's commit 962e0cc71e839c58fb9125fa85511b8bbb8bdbee, I added the
ResolveRecoveryConflictWithSnapshot call in the wrong place. I correctly
put it before spgRedoVacuumRedirect itself would modify the index page ---
but not before RestoreBkpBlocks, so replay of a record with a full-page
image would modify the page before kicking off any conflicting HS
transactions. Oops.
The correct test for whether a redirection tuple is removable is whether
tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding
failed to protect index searches being done in concurrent transactions that
have no XID. This mirrors the recent fix in btree's page recycling logic
made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2.
Also, WAL-log the newest XID of any removed redirection tuple on an index
page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL
replay. This protects against concurrent Hot Standby transactions possibly
needing to see the redirection tuple(s).
Per my query of 2012-03-12 and subsequent discussion.
The original coding failed miserably for BLCKSZ of 4K or less, as reported
by Josh Kupershmidt. With the present design for text indexes, a given
inner tuple could have up to 256 labels (requiring either 3K or 4K bytes
depending on MAXALIGN), which means that we can't positively guarantee no
failures for smaller blocksizes. But we can at least make it behave sanely
so long as there are few enough labels to fit on a page. Considering that
btree is also more prone to "index tuple too large" failures when BLCKSZ is
small, it's not clear that we should expend more work than this on this
case.
The latter was already the dominant use, and it's preferable because
in C the convention is that intXX means XX bits. Therefore, allowing
mixed use of int2, int4, int8, int16, int32 is obviously confusing.
Remove the typedefs for int2 and int4 for now. They don't seem to be
widely used outside of the PostgreSQL source tree, and the few uses
can probably be cleaned up by the time this ships.
A leaf tuple that we need to delete could get moved as a consequence of an
insertion happening concurrently with the VACUUM scan. If it moves from a
page past the current scan point to a page before, we'll miss it, which is
not acceptable. Hence, when we see a leaf-page REDIRECT that could have
been made since our scan started, chase down the redirection pointer much
as if we were doing a normal index search, and be sure to vacuum every page
it leads to. This fixes the issue because, if the tuple was on page N at
the instant we start our scan, we will surely find it as a consequence of
chasing the redirect from page N, no matter how much it moves around in
between. Problem noted by Takashi Yamamoto.
This patch fixes the other major compatibility-breaking limitation of
SPGiST, that it didn't store anything for null values of the indexed
column, and so could not support whole-index scans or "x IS NULL"
tests. The approach is to create a wholly separate search tree for
the null entries, and use fixed "allTheSame" insertion and search
rules when processing this tree, instead of calling the index opclass
methods. This way the opclass methods do not need to worry about
dealing with nulls.
Catversion bump is for pg_am updates as well as the change in on-disk
format of SPGiST indexes; there are some tweaks in SPGiST WAL records
as well.
Heavily rewritten version of a patch by Oleg Bartunov and Teodor Sigaev.
(The original also stored nulls separately, but it reused GIN code to do
so; which required undesirable compromises in the on-disk format, and
would likely lead to bugs due to the GIN code being required to work in
two very different contexts.)
The original API definition was incapable of supporting whole-index scans
because there was no way to invoke leaf-value reconstruction without
checking any qual conditions. Also, it was inefficient for
multiple-qual-condition scans because value reconstruction got done over
again for each qual condition, and because other internal work in the
consistent functions likewise had to be done for each qual. To fix these
issues, pass the whole scankey array to the opclass consistent functions,
instead of only letting them see one item at a time. (Essentially, the
loop over scankey entries is now inside the consistent functions not
outside them. This makes the consistent functions a bit more complicated,
but not unreasonably so.)
In itself this commit does nothing except save a few cycles in
multiple-qual-condition index scans, since we can't support whole-index
scans on SPGiST indexes until nulls are included in the index. However,
I consider this a must-fix for 9.2 because once we release it will get
very much harder to change the opclass API definition.
On reflection, the original name seems way too generic for a global
symbol. A quick check shows this is the only exported function name
in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the
rest of them seem all right.
Operator classes can specify whether or not they support this; this
preserves the flexibility to use lossy representations within an index.
In passing, move constant data about a given index into the rd_amcache
cache area, instead of doing fresh lookups each time we start an index
operation. This is mainly to try to make sure that spgcanreturn() has
insignificant cost; I still don't have any proof that it matters for
actual index accesses. Also, get rid of useless copying of FmgrInfo
pointers; we can perfectly well use the relcache's versions in-place.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple. However,
the SP-GiST AM forces the issue, so let's fix it.
This patch only installs the new API; no behavior actually changes.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees. As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.
There are a number of areas that could still use improvement, but at this
point the code seems committable.
Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane