This patch fixes several related cases in which pg_shdepend entries were
never made, or were lost, for references to roles appearing in the ACLs of
schemas and/or types. While that did no immediate harm, if a referenced
role were later dropped, the drop would be allowed and would leave a
dangling reference in the object's ACL. That still wasn't a big problem
for normal database usage, but it would cause obscure failures in
subsequent dump/reload or pg_upgrade attempts, taking the form of
attempts to grant privileges to all-numeric role names. (I think I've
seen field reports matching that symptom, but can't find any right now.)
Several cases are fixed here:
1. ALTER DOMAIN SET/DROP DEFAULT would lose the dependencies for any
existing ACL entries for the domain. This case is ancient, dating
back as far as we've had pg_shdepend tracking at all.
2. If a default type privilege applies, CREATE TYPE recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for types in 9.2.
3. If a default schema privilege applies, CREATE SCHEMA recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for schemas in v10
(commit ab89e465c).
Another somewhat-related problem is that when creating a relation
rowtype or implicit array type, TypeCreate would apply any available
default type privileges to that type, which we don't really want
since such an object isn't supposed to have privileges of its own.
(You can't, for example, drop such privileges once they've been added
to an array type.)
ab89e465c is also to blame for a race condition in the regression tests:
privileges.sql transiently installed globally-applicable default
privileges on schemas, which sometimes got absorbed into the ACLs of
schemas created by concurrent test scripts. This should have resulted
in failures when privileges.sql tried to drop the role holding such
privileges; but thanks to the bug fixed here, it instead led to dangling
ACLs in the final state of the regression database. We'd managed not to
notice that, but it became obvious in the wake of commit da906766c, which
allowed the race condition to occur in pg_upgrade tests.
To fix, add a function recordDependencyOnNewAcl to encapsulate what
callers of get_user_default_acl need to do; while the original call
sites got that right via ad-hoc code, none of the later-added ones
have. Also change GenerateTypeDependencies to generate these
dependencies, which requires adding the typacl to its parameter list.
(That might be annoying if there are any extensions calling that
function directly; but if there are, they're most likely buggy in the
same way as the core callers were, so they need work anyway.) While
I was at it, I changed GenerateTypeDependencies to accept most of its
parameters in the form of a Form_pg_type pointer, making its parameter
list a bit less unwieldy and mistake-prone.
The test race condition is fixed just by wrapping the addition and
removal of default privileges into a single transaction, so that that
state is never visible externally. We might eventually prefer to
separate out tests of default privileges into a script that runs by
itself, but that would be a bigger change and would make the tests
run slower overall.
Back-patch relevant parts to all supported branches.
Discussion: https://postgr.es/m/15719.1541725287@sss.pgh.pa.us
This commit fixes a set of issues with ON COMMIT actions when used on
partitioned tables and tables with inheritance children:
- Applying ON COMMIT DROP on a partitioned table with partitions or on a
table with inheritance children caused a failure at commit time, with
complains about the children being already dropped as all relations are
dropped one at the same time.
- Applying ON COMMIT DELETE on a partition relying on a partitioned
table which uses ON COMMIT DROP would cause the partition truncation to
fail as the parent is removed first.
The solution to the first problem is to handle the removal of all the
dependencies in one go instead of dropping relations one-by-one, based
on a suggestion from Álvaro Herrera. So instead all the relation OIDs
to remove are gathered and then processed in one round of multiple
deletions.
The solution to the second problem is to reorder the actions, with
truncation happening first and relation drop done after. Even if it
means that a partition could be first truncated, then immediately
dropped if its partitioned table is dropped, this has the merit to keep
the code simple as there is no need to do existence checks on the
relations to drop.
Contrary to a manual TRUNCATE on a partitioned table, ON COMMIT DELETE
does not cascade to its partitions. The ON COMMIT action defined on
each partition gets the priority.
Author: Michael Paquier
Reviewed-by: Amit Langote, Álvaro Herrera, Robert Haas
Discussion: https://postgr.es/m/68f17907-ec98-1192-f99f-8011400517f5@lab.ntt.co.jp
Backpatch-through: 10
Previously it was possible to set client_min_messages to FATAL or PANIC,
which had the effect of suppressing transmission of regular ERROR messages
to the client. Perhaps that seemed like a useful option in the past, but
the trouble with it is that it breaks guarantees that are explicitly made
in our FE/BE protocol spec about how a query cycle can end. While libpq
and psql manage to cope with the omission, that's mostly because they
are not very bright; client libraries that have more semantic knowledge
are likely to get confused. Notably, pgODBC doesn't behave very sanely.
Let's fix this by getting rid of the ability to set client_min_messages
above ERROR.
In HEAD, just remove the FATAL and PANIC options from the set of allowed
enum values for client_min_messages. (This change also affects
trace_recovery_messages, but that's OK since these aren't useful values
for that variable either.)
In the back branches, there was concern that rejecting these values might
break applications that are explicitly setting things that way. I'm
pretty skeptical of that argument, but accommodate it by accepting these
values and then internally setting the variable to ERROR anyway.
In all branches, this allows a couple of tiny simplifications in the
logic in elog.c, so do that.
Also respond to the point that was made that client_min_messages has
exactly nothing to do with the server's logging behavior, and therefore
does not belong in the "When To Log" subsection of the documentation.
The "Statement Behavior" subsection is a better match, so move it there.
Jonah Harris and Tom Lane
Discussion: https://postgr.es/m/7809.1541521180@sss.pgh.pa.us
Discussion: https://postgr.es/m/15479-ef0f4cc2fd995ca2@postgresql.org
The original code to propagate NOT NULL and default expressions
specified when creating a partition was mostly copy-pasted from
typed-tables creation, but not being a great match it contained some
duplicity, inefficiency and bugs.
This commit fixes the bug that NOT NULL constraints declared in the
parent table would not be honored in the partition. One reported issue
that is not fixed is that a DEFAULT declared in the child is not used
when inserting through the parent. That would amount to a behavioral
change that's better not back-patched.
This rewrite makes the code simpler:
1. instead of checking for duplicate column names in its own block,
reuse the original one that already did that;
2. instead of concatenating the list of columns from parent and the one
declared in the partition and scanning the result to (incorrectly)
propagate defaults and not-null constraints, just scan the latter
searching the former for a match, and merging sensibly. This works
because we know the list in the parent is already correct and there can
only be one parent.
This rewrite makes ColumnDef->is_from_parent unused, so it's removed
on branch master; on released branches, it's kept as an unused field in
order not to cause ABI incompatibilities.
This commit also adds a test case for creating partitions with
collations mismatching that on the parent table, something that is
closely related to the code being patched. No code change is introduced
though, since that'd be a behavior change that could break some (broken)
working applications.
Amit Langote wrote a less invasive fix for the original
NOT NULL/defaults bug, but while I kept the tests he added, I ended up
not using his original code. Ashutosh Bapat reviewed Amit's fix. Amit
reviewed mine.
Author: Álvaro Herrera, Amit Langote
Reviewed-by: Ashutosh Bapat, Amit Langote
Reported-by: Jürgen Strobel (bug #15212)
Discussion: https://postgr.es/m/152746742177.1291.9847032632907407358@wrigleys.postgresql.org
The "rb" prefix is used by Ruby, so that our existing code results
in name collisions that break plruby. We discussed ways to prevent
that by adjusting dynamic linker options, but it seems that at best
we'd move the pain to other cases. Renaming to avoid the collision
is the only portable fix anyway. Fortunately, our rbtree code is
not (yet?) widely used --- in core, there's only a single usage
in GIN --- so it seems likely that we can get away with a rename.
I chose to do this basically as s/rb/rbt/g, except for places where
there already was a "t" after "rb". The patch could have been made
smaller by only touching linker-visible symbols, but it would have
resulted in oddly inconsistent-looking code. Better to make it look
like "rbt" was the plan all along.
Back-patch to v10. The rbtree.c code exists back to 9.5, but
rb_iterate() which is the actual immediate source of pain was added
in v10, so it seems like changing the names before that would have
more risk than benefit.
Per report from Pavel Raiskup.
Discussion: https://postgr.es/m/4738198.8KVIIDhgEB@nb.usersys.redhat.com
I removed the item about the pg_stat_statements change from
release-11.sgml, as part of a sweep to delete items already committed
in 11.0; but actually we'd best keep it to ensure that people who've
pg_upgraded their databases will take the requisite action. Also make
said action more visible by making it into its own para. Noted by
Jonathan Katz.
When a partition is created as part of a trigger processing, it is
possible that the partition which just gets created changes the
properties of the table the executor of the ongoing command relies on,
causing a subsequent crash. This has been found possible when for
example using a BEFORE INSERT which creates a new partition for a
partitioned table being inserted to.
Any attempt to do so is blocked when working on a partition, with
regression tests added for both CREATE TABLE PARTITION OF and ALTER
TABLE ATTACH PARTITION.
Reported-by: Dmitry Shalashov
Author: Amit Langote
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/15437-3fe01ee66bd1bae1@postgresql.org
Backpatch-through: 10
Those tables have no physical storage, making this option unusable with
partition trees as at commit time an actual truncation was attempted.
There are still issues with the way ON COMMIT actions are done when
mixing several action types, however this impacts as well inheritance
trees, so this issue will be dealt with later.
Reported-by: Rajkumar Raghuwanshi
Author: Amit Langote
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/CAKcux6mhgcjSiB_egqEAEFgX462QZtncU8QCAJ2HZwM-wWGVew@mail.gmail.com
On Windows, in UTF8 database encoding, what char2wchar() produces is
UTF16 not UTF32, ie, characters above U+FFFF will be represented by
surrogate pairs. t_isdigit() and siblings did not account for this
and failed to provide a large enough result buffer. That in turn
led to bogus "invalid multibyte character for locale" errors, because
contrary to what you might think from char2wchar()'s documentation,
its Windows code path doesn't cope sanely with buffer overflow.
The solution for t_isdigit() and siblings is pretty clear: provide
a 3-wchar_t result buffer not 2.
char2wchar() also needs some work to provide more consistent, and more
accurately documented, buffer overrun behavior. But that's a bigger job
and it doesn't actually have any immediate payoff, so leave it for later.
Per bug #15476 from Kenji Uno, who deserves credit for identifying the
cause of the problem. Back-patch to all active branches.
Discussion: https://postgr.es/m/15476-4314f480acf0f114@postgresql.org
The solution arrived at in commit e74dd00f5 presumes that the compiler
has a suitable default -isysroot setting ... but further experience
shows that in many combinations of macOS version, XCode version, Xcode
command line tools version, and phase of the moon, Apple's compiler
will *not* supply a default -isysroot value.
We could potentially go back to the approach used in commit 68fc227dd,
but I don't have a lot of faith in the reliability or life expectancy of
that either. Let's just revert to the approach already shipped in 11.0,
namely specifying an -isysroot switch globally. As a partial response to
the concerns raised by Jakob Egger, adjust the contents of Makefile.global
to look like
CPPFLAGS = -isysroot $(PG_SYSROOT) ...
PG_SYSROOT = /path/to/sysroot
This allows overriding the sysroot path at build time in a relatively
painless way.
Add documentation to installation.sgml about how to use the PG_SYSROOT
option. I also took the opportunity to document how to work around
macOS's "System Integrity Protection" feature.
As before, back-patch to all supported versions.
Discussion: https://postgr.es/m/20840.1537850987@sss.pgh.pa.us
Previously it was possible to create a slot, change wal_level, and
restart, even if the new wal_level was insufficient for the
slot. That's a problem for both logical and physical slots, because
the necessary WAL records are not generated.
This removes a few tests in newer versions that, somewhat
inexplicably, whether restarting with a too low wal_level worked (a
buggy behaviour!).
Reported-By: Joshua D. Drake
Author: Andres Freund
Discussion: https://postgr.es/m/20181029191304.lbsmhshkyymhw22w@alap3.anarazel.de
Backpatch: 9.4-, where replication slots where introduced
spgendscan neglected to pfree all the memory allocated by spgbeginscan.
It's possible to get away with that in most normal queries, since the
memory is allocated in the executor's per-query context which is about
to get deleted anyway; but it causes severe memory leakage during
creation or filling of large exclusion-constraint indexes.
Also, document that amendscan is supposed to free what ambeginscan
allocates. The docs' lack of clarity on that point probably caused this
bug to begin with. (There is discussion of changing that API spec going
forward, but I don't think it'd be appropriate for the back branches.)
Per report from Bruno Wolff. It's been like this since the beginning,
so back-patch to all active branches.
In HEAD, also fix an independent leak caused by commit 2a6368343
(allocating memory during spgrescan instead of spgbeginscan, which
might be all right if it got cleaned up, but it didn't). And do a bit
of code beautification on that commit, too.
Discussion: https://postgr.es/m/20181024012314.GA27428@wolff.to
This patch absorbs an upstream fix to "zic" for a recently-introduced
bug that made it output data that some 32-bit clients couldn't read.
Given the current source data, the bug only manifests in zones with
leap seconds, which we don't generate, so that there's no actual
change in our installed timezone data files from this. Still, in
case somebody uses our copy of "zic" to do something else, it seems
best to apply the fix promptly.
Also, update the README's notes about converting upstream code to
our conventions.
Modern versions of perl no longer include the current directory in the
perl searchpath, as it's insecure. Instead of adding the current
directory, we get around the problem by adding the directory where the
script lives.
Problem noted by Victor Wagner.
Solution adapted from buildfarm client code.
Backpatch to all live versions.
On Windows this mean that the regression tests can now safely and
successfully run as Administrator, which is useful in situations like
Appveyor. Elsewhere it's a no-op.
Backpatch to 9.5 - this is harder in earlier branches and not worth the
trouble.
Discussion: https://postgr.es/m/650b0c29-9578-8571-b1d2-550d7f89f307@2ndQuadrant.com
PQnotifies() is defined to just process already-read data, not try to read
any more from the socket. (This is a debatable decision, perhaps, but I'm
hesitant to change longstanding library behavior.) The documentation has
long recommended calling PQconsumeInput() before PQnotifies() to ensure
that any already-arrived message would get absorbed and processed.
However, psql did not get that memo, which explains why it's not very
reliable about reporting notifications promptly.
Also, most (not quite all) callers called PQconsumeInput() just once before
a PQnotifies() loop. Taking this recommendation seriously implies that we
should do PQconsumeInput() before each call. This is more important now
that we have "payload" strings in notification messages than it was before;
that increases the probability of having more than one packet's worth
of notify messages. Hence, adjust code as well as documentation examples
to do it like that.
Back-patch to 9.5 to match related server fixes. In principle we could
probably go back further with these changes, but given lack of field
complaints I doubt it's worthwhile.
Discussion: https://postgr.es/m/CAOYf6ec-TmRYjKBXLLaGaB-jrd=mjG1Hzn1a1wufUAR39PQYhw@mail.gmail.com
Commit 4f85fde8e introduced some code that was meant to ensure that we'd
process cancel, die, sinval catchup, and notify interrupts while waiting
for client input. But there was a flaw: it supposed that the process
latch would be set upon arrival at secure_read() if any such interrupt
was pending. In reality, we might well have cleared the process latch
at some earlier point while those flags remained set -- particularly
notifyInterruptPending, which can't be handled as long as we're within
a transaction.
To fix the NOTIFY case, also attempt to process signals (except
ProcDiePending) before trying to read.
Also, if we see that ProcDiePending is set before we read, forcibly set the
process latch to ensure that we will handle that signal promptly if no data
is available. I also made it set the process latch on the way out, in case
there is similar logic elsewhere. (It remains true that we won't service
ProcDiePending here unless we need to wait for input.)
The code for handling ProcDiePending during a write needs those changes,
too.
Also be a little more careful about when to reset whereToSendOutput,
and improve related comments.
Back-patch to 9.5 where this code was added. I'm not entirely convinced
that older branches don't have similar issues, but the complaint at hand
is just about the >= 9.5 code.
Jeff Janes and Tom Lane
Discussion: https://postgr.es/m/CAOYf6ec-TmRYjKBXLLaGaB-jrd=mjG1Hzn1a1wufUAR39PQYhw@mail.gmail.com
About half of this is purely cosmetic changes to reduce the diff between
our code and theirs, like inserting "const" markers where they have them.
The other half is tracking actual code changes in zic.c and localtime.c.
I don't think any of these represent near-term compatibility hazards, but
it seems best to stay up to date.
I also fixed longstanding bugs in our code for producing the
known_abbrevs.txt list, which by chance hadn't been exposed before,
but which resulted in some garbage output after applying the upstream
changes in zic.c. Notably, because upstream removed their old phony
transitions at the Big Bang, it's now necessary to cope with TZif files
containing no DST transition times at all.
DST law changes in Chile, Fiji, and Russia (Volgograd).
Historical corrections for China, Japan, Macau, and North Korea.
Note: like the previous tzdata update, this involves a depressingly
large amount of semantically-meaningless churn in tzdata.zi. That
is a consequence of upstream's data compression method assigning
unstable abbreviations to DST rulesets. I complained about that
to them last time, and this version now uses an assignment method
that pays some heed to not changing abbreviations unnecessarily.
So hopefully, that'll be better going forward.
Mixed-case names for transition tables weren't dumped correctly.
Oversight in commit 8c48375e5, per bug #15440 from Karl Czajkowski.
In passing, I couldn't resist a bit of code beautification.
Back-patch to v10 where this was introduced.
Discussion: https://postgr.es/m/15440-02d1468e94d63d76@postgresql.org
To avoid the sorts of problems complained of by Jakob Egger, it'd be
best if configure didn't emit any references to the sysroot path at all.
In the case of PL/Tcl, we can do that just by keeping our hands off the
TCL_INCLUDE_SPEC string altogether. In the case of PL/Perl, we need to
substitute -iwithsysroot for -I in the compile commands, which is easily
handled if we change to using a configure output variable that includes
the switch not only the directory name. Since PL/Tcl and PL/Python
already do it like that, this seems like good consistency cleanup anyway.
Hence, this replaces the advice given to Perl-related extensions in commit
5e2217131; instead of writing "-I$(perl_archlibexp)/CORE", they should
just write "$(perl_includespec)". (The old way continues to work, but not
on recent macOS.)
It's still the case that configure needs to be aware of the sysroot
path internally, but that's cleaner than what we had before.
As before, back-patch to all supported versions.
Discussion: https://postgr.es/m/20840.1537850987@sss.pgh.pa.us
If the lock wait query failed, isolationtester would report the
PQerrorMessage from some other connection, meaning there would be
no message or an unrelated one. This seems like a pretty unlikely
occurrence, but if it did happen, this bug could make it really
difficult/confusing to figure out what happened. That seems to
justify patching all the way back.
In passing, clean up another place where the "wrong" conn was used
for an error report. That one's not actually buggy because it's
a different alias for the same connection, but it's still confusing
to the reader.
In the IANA timezone code, tzparse() always tries to load the zone
file named by TZDEFRULES ("posixrules"). Previously, we'd hacked
that logic to skip the load in the "lastditch" code path, which we use
only to initialize the default "GMT" zone during GUC initialization.
That's critical for a couple of reasons: since we do not support leap
seconds, we *must not* allow "GMT" to have leap seconds, and since this
case runs before the GUC subsystem is fully alive, we'd really rather
not take the risk of pg_open_tzfile throwing any errors.
However, that still left the code reading TZDEFRULES on every other
call, something we'd noticed to the extent of having added code to cache
the result so it was only done once per process not a lot of times.
Andres Freund complained about the static data space used up for the
cache; but as long as the logic was like this, there was no point in
trying to get rid of that space.
We can improve matters by looking a bit more closely at what the IANA
code actually needs the TZDEFRULES data for. One thing it does is
that if "posixrules" is a leap-second-aware zone, the leap-second
behavior will be absorbed into every POSIX-style zone specification.
However, that's a behavior we'd really prefer to do without, since
for our purposes the end effect is to render every POSIX-style zone
name unsupported. Otherwise, the TZDEFRULES data is used only if
the POSIX zone name specifies DST but doesn't include a transition
date rule (e.g., "EST5EDT" rather than "EST5EDT,M3.2.0,M11.1.0").
That is a minority case for our purposes --- in particular, it
never happens when tzload() invokes tzparse() to interpret a
transition date rule string found in a tzdata zone file.
Hence, if we legislate that we're going to ignore leap-second data
from "posixrules", we can postpone the TZDEFRULES load into the path
where we actually need to substitute for a missing date rule string.
That means it will never happen at all in common scenarios, making it
reasonable to dynamically allocate the cache space when it does happen.
Even when the data is already loaded, this saves some cycles in the
common code path since we avoid a memcpy of 23KB or so. And, IMO at
least, this is a less ugly hack on the IANA logic than what we had
before, since it's not messing with the lastditch-vs-regular code paths.
Back-patch to all supported branches, not so much because this is a
critical change as that I want to keep all our copies of the IANA
timezone code in sync.
Discussion: https://postgr.es/m/20181015200754.7y7zfuzsoux2c4ya@alap3.anarazel.de
Rethink the solution applied in commit 5e2217131 to get PL/Tcl to
build on macOS Mojave. I feared that adding -isysroot globally might
have undesirable consequences, and sure enough Jakob Egger reported
one: it complicates building extensions with a different Xcode version
than was used for the core server. (I find that a risky proposition
in general, but apparently it works most of the time, so we shouldn't
break it if we don't have to.)
We'd already adopted the solution for PL/Perl of inserting the sysroot
path directly into the -I switches used to find Perl's headers, and we
can do the same thing for PL/Tcl by changing the -iwithsysroot switch
that Apple's tclConfig.sh reports. This restricts the risks to PL/Perl
and PL/Tcl themselves and directly-dependent extensions, which is a lot
more pleasing in general than a global -isysroot switch.
Along the way, tighten the test to see if we need to inject the sysroot
path into $perl_includedir, as I'd speculated about upthread but not
gotten round to doing.
As before, back-patch to all supported versions.
Discussion: https://postgr.es/m/20840.1537850987@sss.pgh.pa.us
We created a temp table, then switched to a new session, leaving
the old session to clean up its temp objects in background. If that
took long enough, the eventual attempt to drop the user that owns
the temp table could fail, as exhibited today by sidewinder.
Fix by dropping the temp table explicitly when we're done with it.
It's been like this for quite some time, so back-patch to all
supported branches.
Report: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2018-10-16%2014%3A45%3A00
Reporting only the stderr is unhelpful when the problem is that the
server output we're getting doesn't match what was expected. So we
should report the query output too; and just for good measure, let's
print the query we used and the output we expected.
Back-patch to 9.5 where poll_query_until was introduced.
Discussion: https://postgr.es/m/17913.1539634756@sss.pgh.pa.us
Commit b5febc1d1 added a contrib/btree_gist test case that has been
observed to fail in the buildfarm as a result of background auto-analyze
updating stats and changing the selected plan. Forestall that by
forcibly analyzing in foreground, instead. The new plan choice is
just as good for our purposes, since we really only care that an
index-only plan does not get selected.
Back-patch to 9.5, like the previous patch.
Discussion: https://postgr.es/m/14643.1539629304@sss.pgh.pa.us
localtime.c's "struct state" is a rather large object, ~23KB. We were
statically allocating one for gmtsub() to use to represent the GMT
timezone, even though that function is not at all heavily used and is
never reached in most backends. Let's malloc it on-demand, instead.
This does pose the question of how to handle a malloc failure, but
there's already a well-defined error report convention here, ie
set errno and return NULL.
We have but one caller of pg_gmtime in HEAD, and two in back branches,
neither of which were troubling to check for error. Make them do so.
The possible errors are sufficiently unlikely (out-of-range timestamp,
and now malloc failure) that I think elog() is adequate.
Back-patch to all supported branches to keep our copies of the IANA
timezone code in sync. This particular change is in a stanza that
already differs from upstream, so it's a wash for maintenance purposes
--- but only as long as we keep the branches the same.
Discussion: https://postgr.es/m/20181015200754.7y7zfuzsoux2c4ya@alap3.anarazel.de
ProcessUtility can recurse, and indeed can be driven to infinite
recursion, so it ought to have a check_stack_depth() call. This
covers the reported bug (portal trying to execute itself) and a bunch
of other cases that could perhaps arise somewhere.
Per bug #15428 from Malthe Borch. Back-patch to all supported branches.
Discussion: https://postgr.es/m/15428-b3c2915ec470b033@postgresql.org
On a primary, sets of XLOG_RUNNING_XACTS records are generated on a
periodic basis to allow recovery to build the initial state of
transactions for a hot standby. The set of transaction IDs is created
by scanning all the entries in ProcArray. However it happens that its
logic never counted on the fact that two-phase transactions finishing to
prepare can put ProcArray in a state where there are two entries with
the same transaction ID, one for the initial transaction which gets
cleared when prepare finishes, and a second, dummy, entry to track that
the transaction is still running after prepare finishes. This way
ensures a continuous presence of the transaction so as callers of for
example TransactionIdIsInProgress() are always able to see it as alive.
So, if a XLOG_RUNNING_XACTS takes a standby snapshot while a two-phase
transaction finishes to prepare, the record can finish with duplicated
XIDs, which is a state expected by design. If this record gets applied
on a standby to initial its recovery state, then it would simply fail,
so the odds of facing this failure are very low in practice. It would
be tempting to change the generation of XLOG_RUNNING_XACTS so as
duplicates are removed on the source, but this requires to hold on
ProcArrayLock for longer and this would impact all workloads,
particularly those using heavily two-phase transactions.
XLOG_RUNNING_XACTS is also actually used only to initialize the standby
state at recovery, so instead the solution is taken to discard
duplicates when applying the initial snapshot.
Diagnosed-by: Konstantin Knizhnik
Author: Michael Paquier
Discussion: https://postgr.es/m/0c96b653-4696-d4b4-6b5d-78143175d113@postgrespro.ru
Backpatch-through: 9.3
In the back branches, drop these tables after the regression tests are
done with them. This fixes failures of cross-branch pg_upgrade testing
caused by these types having been removed in v12. We do lose the ability
to test dump/restore behavior with these types in the back branches, but
the actual loss of code coverage seems to be nil given that there's nothing
very special about these types.
Discussion: https://postgr.es/m/20181009192237.34wjp3nmw7oynmmr@alap3.anarazel.de
Repeatedly rewriting a mapped catalog table with VACUUM FULL or
CLUSTER could cause logical decoding to fail with:
ERROR, "could not map filenode \"%s\" to relation OID"
To trigger the problem the rewritten catalog had to have live tuples
with toasted columns.
The problem was triggered as during catalog table rewrites the
heap_insert() check that prevents logical decoding information to be
emitted for system catalogs, failed to treat the new heap's toast table
as a system catalog (because the new heap is not recognized as a
catalog table via RelationIsLogicallyLogged()). The relmapper, in
contrast to the normal catalog contents, does not contain historical
information. After a single rewrite of a mapped table the new relation
is known to the relmapper, but if the table is rewritten twice before
logical decoding occurs, the relfilenode cannot be mapped to a
relation anymore. Which then leads us to error out. This only
happens for toast tables, because the main table contents aren't
re-inserted with heap_insert().
The fix is simple, add a new heap_insert() flag that prevents logical
decoding information from being emitted, and accept during decoding
that there might not be tuple data for toast tables.
Unfortunately that does not fix pre-existing logical decoding
errors. Doing so would require not throwing an error when a filenode
cannot be mapped to a relation during decoding, and that seems too
likely to hide bugs. If it's crucial to fix decoding for an existing
slot, temporarily changing the ERROR in ReorderBufferCommit() to a
WARNING appears to be the best fix.
Author: Andres Freund
Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Backpatch: 9.4-, where logical decoding was introduced