The original purpose of the dry-run mode is to be able to print all the
possible permutations from a spec file, but it has become less useful
since isolation tests have improved regarding deadlock detection as one
step not wanted by the author could block indefinitely now (originally
the step blocked would have been detected rather quickly). Per
discussion, let's remove it.
This is a backpatch of 9903338 for 9.6~12. It is proving to become
useful to have on those branches so as the code gets consistent across
all supported versions, as a matter of improving the output generated by
isolationtester.
Author: Michael Paquier
Reviewed-by: Asim Praveen, Melanie Plageman
Discussion: https://postgr.es/m/20190819080820.GG18166@paquier.xyz
Discussion: https://postgr.es/m/794820.1623872009@sss.pgh.pa.us
Backpatch-through: 9.6
isolationtester.c had a hard-wired limit of 3 minutes per test step.
It now emerges that this isn't quite enough for some of the slowest
buildfarm animals. This isn't the first time we've had to raise
this limit (cf. 1db439ad4), so let's make it configurable. This
patch raises the default to 5 minutes, and introduces an environment
variable PGISOLATIONTIMEOUT that can be set if more time is needed,
following the precedent of PGCTLTIMEOUT.
Also, modify isolationtester so that when the timeout is hit,
it explicitly reports having sent a cancel. This makes the regression
failure log considerably more intelligible. (In the worst case, a
timed-out test might actually be reported as "passing" without this
extra output, so arguably this is a bug fix in itself.)
In passing, update the README file, which had apparently not gotten
touched when we added "make check" support here.
Back-patch to 9.6; older versions don't have comparable timeout logic.
Discussion: https://postgr.es/m/22964.1575842935@sss.pgh.pa.us
Back-patch commit b10f40bf0 into older branches. This adds reporting
of NOTIFY messages to isolationtester.c, and extends the async-notify
test to include direct tests of basic NOTIFY functionality.
This provides useful infrastructure for testing a bug fix I'm about
to back-patch, and there seems no good reason not to have better tests
of LISTEN/NOTIFY in the back branches. The commit's survived long
enough in HEAD to make it unlikely that it will cause problems.
Back-patch as far as 9.6. isolationtester.c changed too much in 9.6
to make it sane to try to fix older branches this way, and I don't
really want to back-patch those changes too.
Discussion: https://postgr.es/m/31304.1564246011@sss.pgh.pa.us
Back-patch relevant parts of these commits:
30717637c Fix isolationtester race condition for notices sent before blocking
ebd499282 Don't drop NOTICE messages in isolation tests
a28e10e82 Indicate session name in isolationtester notices
This ensures that older versions of the isolationtester will handle
NOTICE/WARNING messages the same way as HEAD and v12 do. While this
isn't fixing any critical problem right now, it seems like a prudent
change to prevent surprises (like we had yesterday...) with
back-patches of future isolation test changes.
Back-patch as far as 9.6. Due to the significant changes we made in
isolationtester in 9.6, back-patching isolation tests further than
that is going to be risky anyway; besides, this patch doesn't apply
cleanly before that.
Discussion: https://postgr.es/m/E1i7IqC-0000Uc-5H@gemulon.postgresql.org
Slow runs of buildfarm members chipmunk, hornet and mandrill saw the
shorter timeouts expire. The 180s timeout in poll_query_until has been
trouble-free since 2a0f89cd717ce6d49cdc47850577823682167e87 introduced
it two years ago, so use 180s more widely. Back-patch to 9.6, where the
first of these timeouts was introduced.
Reviewed by Michael Paquier.
Discussion: https://postgr.es/m/20181209001601.GC2973271@rfd.leadboat.com
If the lock wait query failed, isolationtester would report the
PQerrorMessage from some other connection, meaning there would be
no message or an unrelated one. This seems like a pretty unlikely
occurrence, but if it did happen, this bug could make it really
difficult/confusing to figure out what happened. That seems to
justify patching all the way back.
In passing, clean up another place where the "wrong" conn was used
for an error report. That one's not actually buggy because it's
a different alias for the same connection, but it's still confusing
to the reader.
Doing this suppresses Coverity warnings and might allow improved
code in some cases. The prospects of that are not so bright as
to warrant back-patching, though.
Michael Paquier, per Coverity
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
Commit 4deb41381 modified isolationtester's query to see whether a
session is blocked to also check for waits occurring in GetSafeSnapshot.
However, it did that in a way that enormously increased the query's
runtime under CLOBBER_CACHE_ALWAYS, causing the buildfarm members
that use that to run about four times slower than before, and in some
cases fail entirely. To fix, push the entire logic into a dedicated
backend function. This should actually reduce the CLOBBER_CACHE_ALWAYS
runtime from what it was previously, though I've not checked that.
In passing, expose a SQL function to check for safe-snapshot blockage,
comparable to pg_blocking_pids. This is more or less free given the
infrastructure built to solve the other problem, so we might as well.
Thomas Munro
Discussion: https://postgr.es/m/20170407165749.pstcakbc637opkax@alap3.anarazel.de
This improves code coverage and lays a foundation for testing
similar issues in a distributed environment.
Author: Thomas Munro <thomas.munro@enterprisedb.com>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
c.h #includes a number of core libc header files, such as <stdio.h>.
There's no point in re-including these after having read postgres.h,
postgres_fe.h, or c.h; so remove code that did so.
While at it, also fix some places that were ignoring our standard pattern
of "include postgres[_fe].h, then system header files, then other Postgres
header files". While there's not any great magic in doing it that way
rather than system headers last, it's silly to have just a few files
deviating from the general pattern. (But I didn't attempt to enforce this
globally, only in files I was touching anyway.)
I'd be the first to say that this is mostly compulsive neatnik-ism,
but over time it might save enough compile cycles to be useful.
Where possible, use palloc or pg_malloc instead; otherwise, insert
explicit NULL checks.
Generally speaking, these are places where an actual OOM is quite
unlikely, either because they're in client programs that don't
allocate all that much, or they're very early in process startup
so that we'd likely have had a fork() failure instead. Hence,
no back-patch, even though this is nominally a bug fix.
Michael Paquier, with some adjustments by me
Discussion: <CAB7nPqRu07Ot6iht9i9KRfYLpDaF2ZuUv5y_+72uP23ZAGysRg@mail.gmail.com>
The previous coding here was formally undefined, though it seems to
accidentally work on most platforms in the buildfarm. Caught by some
OpenBSD platforms in which libc contains an assertion check for
overlapping areas passed to memcpy().
Thomas Munro
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether. (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)
The new function has the following behaviors that are painful or impossible
to get right via pg_locks:
1. Correctly understands which lock modes block which other ones.
2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.
3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.
The motivation for doing this right now is mostly to fix the isolation
tests. Commit 38f8bdcac4982215beb9f65a19debecaf22fd470 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly. But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions. Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
This mostly reverts commit 9c9782f066e0ce5424b8706df2cce147cb78170f.
I left in the parts that rearranged removal of completed waiting steps;
but the idea of not rechecking a step's blocked-ness isn't working.
If we're retrying a step, then we already decided it was blocked on a lock,
and there's no need to recheck that. The original coding of commit
38f8bdcac4982215beb9f65a19debecaf22fd470 resulted in a large number of
is-it-waiting queries when dealing with multiple concurrently-blocked
sessions, which is fairly pointless and also results in test failures in
CLOBBER_CACHE_ALWAYS builds, where the is-it-waiting query is quite slow.
This definition also permits appending pg_sleep() calls to steps where it's
needed to control the order of finish of concurrent steps. Before, that
did not work nicely because we'd decide that a step performing a sleep was
not blocked and hang up waiting for it to finish, rather than noticing the
completion of the concurrent step we're supposed to notice first.
In passing, revise handling of removal of completed waiting steps
to make it a bit less messy.
Fix a few oversights in 38f8bdcac4982215beb9f65a19debecaf22fd470:
don't leak memory in run_permutation(), remember when we've issued
a cancel rather than issuing another one every 10ms,
fix some typos in comments.
This allows testing of deadlock scenarios. Scenarios that would
previously have been considered invalid are now simply taken as a
scenario in which more than one backend will wait.
We used to have externs for getopt() and its API variables scattered
all over the place. Now that we find we're going to need to tweak the
variable declarations for Cygwin, it seems like a good idea to have
just one place to tweak.
In this commit, the variables are declared "#ifndef HAVE_GETOPT_H".
That may or may not work everywhere, but we'll soon find out.
Andres Freund
This ensures that all stdout output is flushed immediately, to match
stderr. This eliminates the need for fflush(stdout) calls sprinkled all
over the place.
Per Daniel Wood in message 519A79C6.90308@salesforce.com
Add asprintf(), pg_asprintf(), and psprintf() to simplify string
allocation and composition. Replacement implementations taken from
NetBSD.
Reviewed-by: Álvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Asif Naeem <anaeem.it@gmail.com>
Previously, isolationtester would forbid returning tuples in
session-specific teardown (but not global teardown), as well as in
global setup. Allow these places to return tuples, too.
Notice and complain about PQcancel() failures. Also, don't dump core if
an error PGresult doesn't contain severity and message subfields, as it
might not if it was generated by libpq itself. (We have a longstanding
TODO item to improve that, but in the meantime isolationtester had better
cope.)
I tripped across the latter item while investigating a trouble report on
buildfarm member spoonbill. As for the former, there's no evidence that
PQcancel failure is actually involved in spoonbill's problem, but it still
seems like a bad idea to ignore an error return code.
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com1290721684-sup-3951@alvh.no-ip.org1294953201-sup-2099@alvh.no-ip.org1320343602-sup-2290@alvh.no-ip.org1339690386-sup-8927@alvh.no-ip.org4FE5FF020200002500048A3D@gw.wicourts.gov4FEAB90A0200002500048B7D@gw.wicourts.gov
Each setup block is run as a single PQexec submission, and some
statements such as VACUUM cannot be combined with others in such a
block.
Backpatch to 9.2.
Kevin Grittner and Tom Lane
Much more could be done here, but at least now we have *some* automated
test coverage of that mechanism. In particular this tests the writable-CTE
case reported by Phil Sorber.
In passing, remove isolationtester's arbitrary restriction on the number of
steps in a permutation list. I used this so that a single spec file could
be used to run several related test scenarios, but there are other possible
reasons to want a step series that's not exactly a permutation. Improve
documentation and fix a couple other nits as well.
isolationtester is now able to continue running other permutations when
it detects that one of them is invalid, which is useful during initial
development of spec files.
Author: Alexander Shulgin
A permutation that specifies more steps than defined causes
isolationtester to crash, so avoid that. Using less steps than defined
should probably not be a problem, but no spec currently does that.
I broke it in a previous commit because I neglected to install the
necessary incantations to have getopt() work on Windows.
Per red blots in buildfarm.
This mode prints out the permutations that would be run by the given
spec file, in the same format used by the permutation lines in spec
files. This helps in building new spec files.
Author: Alexander Shulgin, with some tweaks by me
We now report errors reported by the just-unblocked and unblocking
transactions identically; this should fix relatively common buildfarm
failures reported by animals that are failing the "wrong" session.
order of begin, prepare, and commit of three concurrent transactions that
have conflicts between them.
The test runs for a quite long time, and the expected output file is huge,
but this test caught some serious bugs during development, so seems
worthwhile to keep. The test uses prepared transactions, so it fails if the
server has max_prepared_transactions=0. Because of that, it's marked as
"ignore" in the schedule file.
Dan Ports
Noah Misch diagnosed the buildfarm problems in the isolation tests
partly as failure to differentiate backends properly; the old code was
using backend IDs, which is not good enough because a new backend might
use an already used ID. Use PIDs instead.
Also, the code was purposely careless about other concurrent activity,
because it isn't expected; and in fact, it doesn't affect the vast
majority of the time. However, it can be observed that autovacuum can
block tables for long enough to cause sporadic failures. The new code
accounts for that by ignoring locks held by processes not explicitly
declared in our spec file.
Author: Noah Misch
This enables us to test that blocking commands (such as foreign keys
checks that conflict with some other lock) act as intended. The set of
tests that this adds is pretty minimal, but can easily be extended by
adding new specs.
The intention is that this will serve as a basis for ensuring that
further tweaks of locking implementation preserve (or improve) existing
behavior.
Author: Noah Misch