I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:
* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)
The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07aef9e508ff047d128a4046cc8bc1e2, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.
Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
Commit 1b86c81d2d fixed the decoding of toasted columns for the rows
contained in one xl_heap_multi_insert record. But that's not actually
enough, because heap_multi_insert() will actually first toast all
passed in rows and then emit several *_multi_insert records; one for
each page it fills with tuples.
Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in
xl_heap_multi_insert->flag denoting that this multi_insert record is
the last emitted by one heap_multi_insert() call. Then use that flag
in decode.c to only set clear_toast_afterwards in the right situation.
Expand the number of rows inserted via COPY in the corresponding
regression test to make sure that more than one heap page is filled
with tuples by one heap_multi_insert() call.
Backpatch to 9.4 like the previous commit.
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time. The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.
Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum. Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.
Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.
While at it, update some comments that hadn't been updated since
multixacts were changed.
Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134f2.
In a50d97625497b7 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.
As reported by Serge Negodyuck in a followup to bug #8673.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.
Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.
Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.
To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.
Backpatch to all supported versions.
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row. The same applies for ranges over composite types, nested
composites, etc. However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites. For example, given a command such
as
UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...
where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column. If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced. A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.
Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().
I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out). The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.
This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple. This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first. With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.
The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before. On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings. Experimentation suggests that common SQL
coding patterns are unaffected either way, though. Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.
In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.
Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled. The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.
This bug has existed since TOAST was invented, so back-patch to all
supported branches.
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.
The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple. But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same. This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.
This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.
To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.
Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates. This should protect
against other bugs that might cause similar bogus situations.
Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced. (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything. Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)
Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com
Bug analysis by Andres Freund.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.
In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.
Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.
This change is completely backwards-compatible, no effect on pg_upgrade.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.
There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.
The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.
The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)
Reviewed by Andres Freund.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.
Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.
Problem
-------
We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):
ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"
We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.
To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.
Solution
--------
This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.
This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.
This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.
Reviewed by Kevin Grittner and Peter Geoghegan.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.
Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
With the GIN "fast scan" feature, GIN can skip items without fetching all
the keys for them, if it can prove that they don't match regardless of
those keys. So far, it has done the proving by calling the boolean
consistent function with all combinations of TRUE/FALSE for the unfetched
keys, but since that's O(n^2), it becomes unfeasible with more than a few
keys. We can avoid calling consistent with all the combinations, if we can
tell the operator class implementation directly which keys are unknown.
This commit includes a triConsistent function for the built-in array and
tsvector opclasses.
Alexander Korotkov, with some changes by me.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
If you have a GIN query like "rare & frequent", we currently fetch all the
items that match either rare or frequent, call the consistent function for
each item, and let the consistent function filter out items that only match
one of the terms. However, if we can deduce that "rare" must be present for
the overall qual to be true, we can scan all the rare items, and for each
rare item, skip over to the next frequent item with the same or greater TID.
That greatly speeds up "rare & frequent" type queries.
To implement that, introduce the concept of a tri-state consistent function,
where the 3rd value is MAYBE, indicating that we don't know if that term is
present. Operator classes only provide a boolean consistent function, so we
simulate the tri-state consistent function by calling the boolean function
several times, with the MAYBE arguments set to all combinations of TRUE and
FALSE. Testing all combinations is only feasible for a small number of MAYBE
arguments, but it is envisioned that we'll provide a way for operator
classes to provide a native tri-state consistent function, which can be much
more efficient. But that is not included in this patch.
We were already using that trick to for lossy pages, calling the consistent
function with the lossy entry set to TRUE and FALSE. Now that we have the
tri-state consistent function, use it for lossy pages too.
Alexander Korotkov, with fair amount of refactoring by me.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts. Slots have some advantages over existing
techniques, as explained in the documentation.
In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.
Andres Freund and Robert Haas
When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.
Heavily modified from Alexander Korotkov's fast scan patch.
If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.
Extracted from Alexander Korotkov's fast scan patch, heavily modified.
This makes it possible to store lwlocks as part of some other data
structure in the main shared memory segment, or in a dynamic shared
memory segment. There is still a main LWLock array and this patch does
not move anything out of it, but it provides necessary infrastructure
for doing that in the future.
This change is likely to increase the size of LWLockPadded on some
platforms, especially 32-bit platforms where it was previously only
16 bytes.
Patch by me. Review by Andres Freund and KaiGai Kohei.
This allows ending recovery as a consistent state has been reached. Without
this, there was no easy way to e.g restore an online backup, without
replaying any extra WAL after the backup ended.
MauMau and me.
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.
To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.
This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.
Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted. (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.) In Hot Standby,
the WAL replay process must likewise lock every leaf page. There were
several bugs in the code for that:
* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error. This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages". To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.
* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about). Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.
* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page. However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.
The first two of these bugs were diagnosed by Andres Freund, the other one
by me. Fixes based on ideas from Heikki Linnakangas and myself.
This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
In pg_multixact/members, relying on modulo-2^32 arithmetic for
wraparound handling doesn't work all that well. Because we don't
explicitely track wraparound of the allocation counter for members, it
is possible that the "live" area exceeds 2^31 entries; trying to remove
SLRU segments that are "old" according to the original logic might lead
to removal of segments still in use. To fix, have the truncation
routine use a tailored SlruScanDirectory callback that keeps track of
the live area in actual use; that way, when the live range exceeds 2^31
entries, the oldest segments still live will not get removed untimely.
This new SlruScanDir callback needs to take care not to remove segments
that are "in the future": if new SLRU segments appear while the
truncation is ongoing, make sure we don't remove them. This requires
examination of shared memory state to recheck for false positives, but
testing suggests that this doesn't cause a problem. The original coding
didn't suffer from this pitfall because segments created when truncation
is running are never considered to be removable.
Per Andres Freund's investigation of bug #8673 reported by Serge
Negodyuck.
Instead of changing the tuple xmin to FrozenTransactionId, the combination
of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never
set together, is now defined as HEAP_XMIN_FROZEN. A variety of previous
proposals to freeze tuples opportunistically before vacuum_freeze_min_age
is reached have foundered on the objection that replacing xmin by
FrozenTransactionId might hinder debugging efforts when things in this
area go awry; this patch is intended to solve that problem by keeping
the XID around (but largely ignoring the value to which it is set).
Third-party code that checks for HEAP_XMIN_INVALID on tuples where
HEAP_XMIN_COMMITTED might be set will be broken by this change. To fix,
use the new accessor macros in htup_details.h rather than consulting the
bits directly. HeapTupleHeaderGetXmin has been modified to return
FrozenTransactionId when the infomask bits indicate that the tuple is
frozen; use HeapTupleHeaderGetRawXmin when you already know that the
tuple isn't marked commited or frozen, or want the raw value anyway.
We currently do this in routines that display the xmin for user consumption,
in tqual.c where it's known to be safe and important for the avoidance of
extra cycles, and in the function-caching code for various procedural
languages, which shouldn't invalidate the cache just because the tuple
gets frozen.
Robert Haas and Andres Freund
If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple. This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).
This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134f it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.
It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.
Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
Tuple freezing was broken in connection to MultiXactIds; commit
8e53ae025de9 tried to fix it, but didn't go far enough. As noted by
Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted
update might cause locks in the multi to go ignored by later
transactions. This is because the code depended on a multixact above
their cutoff point not having any lock-only member older than the cutoff
point for Xids, which is easily defeated in READ COMMITTED transactions.
The fix for this involves creating a new MultiXactId when necessary.
But this cannot be done during WAL replay, and moreover multixact
examination requires using CLOG access routines which are not supposed
to be used during WAL replay either; so tuple freezing cannot be done
with the old freeze WAL record. Therefore, separate the freezing
computation from its execution, and change the WAL record to carry all
necessary information. At WAL replay time, it's easy to re-execute
freezing because we don't need to re-compute the new infomask/Xmax
values but just take them from the WAL record.
While at it, restructure the coding to ensure all page changes occur in
a single critical section without much room for failures. The previous
coding wasn't using a critical section, without any explanation as to
why this was acceptable.
In replication scenarios using the 9.3 branch, standby servers must be
upgraded before their master, so that they are prepared to deal with the
new WAL record once the master is upgraded; failure to do so will cause
WAL replay to die with a PANIC message. Later upgrade of the standby
will allow the process to continue where it left off, so there's no
disruption of the data in the standby in any case. Standbys know how to
deal with the old WAL record, so it's okay to keep the master running
the old code for a while.
In master, the old freeze WAL record is gone, for cleanliness' sake;
there's no compatibility concern there.
Backpatch to 9.3, where the original bug was introduced and where the
previous fix was backpatched.
Álvaro Herrera and Andres Freund
WAL records of hint bit updates is useful to tools that want to examine
which pages have been modified. In particular, this is required to make
the pg_rewind tool safe (without checksums).
This can also be used to test how much extra WAL-logging would occur if
you enabled checksums, without actually enabling them (which you can't
currently do without re-initdb'ing).
Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with
further changes by me.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983ef79be4a84fcd0e0ca3b5fcb85dd65. This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all. Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.
Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.
Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.