Doing so doesn't seem to be within the purpose of the per user
connection limits, and has particularly unfortunate effects in
conjunction with parallel queries.
Backpatch to 9.6 where parallel queries were introduced.
David Rowley, reviewed by Robert Haas and Albe Laurenz.
Commit 4aec49899e reorganized the order
of operations here so that we no longer increment the number of "extra
waits" before locking the semaphore, but it did not change the
starting value of extraWaits from 0 to -1 to compensate. In the worst
case, this could leak a semaphore count, but that seems to be unlikely
in practice.
Discussion: http://postgr.es/m/CAA4eK1JyVqXiMba+-a589Rk0pyHsyKkGxeumVKjU6Y74hdrVLQ@mail.gmail.com
Amit Kapila, per an off-list report by Dilip Kumar. Reviewed by me.
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche. So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached. For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset. This is arguably more useful,
and certainly a lot cheaper. Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.
The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".
The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query. There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously. So if we didn't get rid of this mechanism we'd
have to make it even more complicated. By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.
Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
Previously, the "sem" field of PGPROC varied in size depending on which
kernel semaphore API we were using. That was okay as long as there was
only one likely choice per platform, but in the wake of commit ecb0d20a9,
that assumption seems rather shaky. It doesn't seem out of the question
anymore that an extension compiled against one API choice might be loaded
into a postmaster built with another choice. Moreover, this prevents any
possibility of selecting the semaphore API at postmaster startup, which
might be something we want to do in future.
Hence, change PGPROC.sem to be PGSemaphore (i.e. a pointer) for all Unix
semaphore APIs, and turn the pointed-to data into an opaque struct whose
contents are only known within the responsible modules.
For the SysV and unnamed-POSIX APIs, the pointed-to data has to be
allocated elsewhere in shared memory, which takes a little bit of
rejiggering of the InitShmemAllocation code sequence. (I invented a
ShmemAllocUnlocked() function to make that a little cleaner than it used
to be. That function is not meant for any uses other than the ones it
has now, but it beats having InitShmemAllocation() know explicitly about
allocation of space for semaphores and spinlocks.) This change means an
extra indirection to access the semaphore data, but since we only touch
that when blocking or awakening a process, there shouldn't be any
meaningful performance penalty. Moreover, at least for the unnamed-POSIX
case on Linux, the sem_t type is quite a bit wider than a pointer, so this
reduces sizeof(PGPROC) which seems like a good thing.
For the named-POSIX API, there's effectively no change: the PGPROC.sem
field was and still is a pointer to something returned by sem_open() in
the postmaster's memory space. Document and check the pre-existing
limitation that this case can't work in EXEC_BACKEND mode.
It did not seem worth unifying the Windows semaphore ABI with the Unix
cases, since there's no likelihood of needing ABI compatibility much less
runtime switching across those cases. However, we can simplify the Windows
code a bit if we define PGSemaphore as being directly a HANDLE, rather than
pointer to HANDLE, so let's do that while we're here. (This also ends up
being no change in what's physically stored in PGPROC.sem. We're just
moving the HANDLE fetch from callees to callers.)
It would take a bunch of additional code shuffling to get to the point of
actually choosing a semaphore API at postmaster start, but the effects
of that would now be localized in the port/XXX_sema.c files, so it seems
like fit material for a separate patch. The need for it is unproven as
yet, anyhow, whereas the ABI risk to extensions seems real enough.
Discussion: https://postgr.es/m/4029.1481413370@sss.pgh.pa.us
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.
Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used. Just fill with dummy
values if we're not going to use them.
This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether. (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)
The new function has the following behaviors that are painful or impossible
to get right via pg_locks:
1. Correctly understands which lock modes block which other ones.
2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.
3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.
The motivation for doing this right now is mostly to fix the isolation
tests. Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly. But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions. Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
Commit 0e141c0fbb introduced a new
facility to reduce ProcArrayLock contention by clearing several XIDs
from the ProcArray under a single lock acquisition. The names
initially chosen were deemed not to be very good choices, so commit
4aec49899e renamed them. But now it
seems like we still didn't get it right. A pending patch wants to
add similar infrastructure for batching CLOG updates, so the names
need to be clear enough to allow a new set of structure members with
a related purpose.
Amit Kapila
Previously, each PGPROC's backendLock was part of the main tranche,
and the PGPROC just contained a pointer. Now, the actual LWLock is
part of the PGPROC.
As with previous, similar patches, this makes it significantly easier
to identify these lwlocks in LWLOCK_STATS or Trace_lwlocks output
and improves modularity.
Author: Ildus Kurbangaliev
Reviewed-by: Amit Kapila, Robert Haas
Prior to commit 0709b7ee72, access to
variables within a spinlock-protected critical section had to be done
through a volatile pointer, but that should no longer be necessary.
Michael Paquier
Post-commit review by Andres Freund discovered a couple of concurrency
bugs in the original patch: specifically, if the leader cleared a
follower's XID before it reached PGSemaphoreLock, the semaphore would be
left in the wrong state; and if another process did PGSemaphoreUnlock
for some unrelated reason, we might resume execution before the fact
that our XID was cleared was globally visible.
Also, improve the wording of some comments, rename nextClearXidElem
to firstClearXidElem in PROC_HDR for clarity, and drop some volatile
qualifiers that aren't necessary.
Amit Kapila, reviewed and slightly revised by me.
When a write transaction commits, it must clear its XID advertised via
the ProcArray, which requires that we hold ProcArrayLock in exclusive
mode in order to prevent concurrent processes running GetSnapshotData
from seeing inconsistent results. When many processes try to commit
at once, ProcArrayLock must change hands repeatedly, with each
concurrent process trying to commit waking up to acquire the lock in
turn. To make things more efficient, when more than one backend is
trying to commit a write transaction at the same time, have just one
of them acquire ProcArrayLock in exclusive mode and clear the XIDs of
all processes in the group. Benchmarking reveals that this is much
more efficient at very high client counts.
Amit Kapila, heavily revised by me, with some review also from Pavan
Deolasee.
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.
Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.
Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
This does four basic things. First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers. Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes. Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.
Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
Back in ed0b409, PGPROC was split and moved to static variables in
procarray.c, with procs in ProcArrayStruct replaced by an array of
integers representing process numbers (pgprocnos), with -1 indicating a
dead process which has yet to be removed. Access to procArray is
generally done under ProcArrayLock and therefore most code does not have
to concern itself with -1 entries.
However, MinimumActiveBackends intentionally does not take
ProcArrayLock, which means it has to be extra careful when accessing
procArray. Prior to ed0b409, this was handled by checking for a NULL
in the pointer array, but that check was no longer valid after the
split. Coverity pointed out that the check could never happen and so
it was removed in 5592eba. That didn't make anything worse, but it
didn't fix the issue either.
The correct fix is to check for pgprocno == -1 and skip over that entry
if it is encountered.
Back-patch to 9.2, since there can be attempts to access the arrays
prior to their start otherwise. Note that the changes prior to 9.4 will
look a bit different due to the change in 5592eba.
Note that MinimumActiveBackends only returns a bool for heuristic
purposes and any pre-array accesses are strictly read-only and so there
is no security implication and the lack of fields complaints indicates
it's very unlikely to run into issues due to this.
Pointed out by Noah.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.
Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.
Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables. The output format is controlled by a
so-called "output plugin"; an example is included. To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.
Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.
Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
Additional non-security issues/improvements spotted by Coverity.
In backend/libpq, no sense trying to protect against port->hba being
NULL after we've already dereferenced it in the switch() statement.
Prevent against possible overflow due to 32bit arithmitic in
basebackup throttling (not yet released, so no security concern).
Remove nonsensical check of array pointer against NULL in procarray.c,
looks to be a holdover from 9.1 and earlier when there were pointers
being used but now it's just an array.
Remove pointer check-against-NULL in tsearch/spell.c as we had already
dereferenced it above (in the strcmp()).
Remove dead code from adt/orderedsetaggs.c, isnull is checked
immediately after each tuplesort_getdatum() call and if true we return,
so no point checking it again down at the bottom.
Remove recently added minor error-condition memory leak in pg_regress.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts. Slots have some advantages over existing
techniques, as explained in the documentation.
In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.
Andres Freund and Robert Haas
These bugs can cause data loss on standbys started with hot_standby=on at
the moment they start to accept read only queries, by marking committed
transactions as uncommited. The likelihood of such corruptions is small
unless the primary has a high transaction rate.
5a031a5556 fixed bugs in HS's startup logic
by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state
was reached, missing the fact that both clog and subtrans are written to
before that. This only failed to fail in common cases because the usage
of ExtendCLOG in procarray.c was superflous since clog extensions are
actually WAL logged.
f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing
extensions of pg_subtrans due to the former commit's changes - which are
not WAL logged - by performing the extensions when switching to a state
> STANDBY_INITIALIZED and not performing xid assignments before that -
again missing the fact that ExtendCLOG is unneccessary - but screwed up
twice: Once because latestObservedXid wasn't updated anymore in that
state due to the earlier commit and once by having an off-by-one error in
the loop performing extensions. This means that whenever a
CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed
between the start of the checkpoint recovery started from and the first
xl_running_xact record old transactions commit bits in pg_clog could be
overwritten if they started and committed in that window.
Fix this mess by not performing ExtendCLOG() in HS at all anymore since
it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.
Analysis and patch by Andres Freund. Reported by Christophe Pettus.
Backpatch down to 9.0, like the previous commit that caused this.
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.
Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
Since commit f21bb9cfb5, this function
ignores the caller-provided length and loops until it finds a
terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the
previous loop control logic. In passing, revert the addition of an
unused variable by the same commit, presumably a debugging relic.
Rename PGXACT->inCommit flag into delayChkpt flag,
and generalise comments to allow use in other situations,
such as the forthcoming potential use in checksum patch.
Replace wait loop to look for VXIDs with delayChkpt set.
No user visible changes, not behaviour changes at present.
Simon Riggs, reviewed and rebased by Jeff Davis
Previously we stored all xids mixed together.
Now we store top-level xids first, followed
by all subxids. Also skip logging any subxids
if the snapshot is suboverflowed, since there
are potentially large numbers of them and they
are not useful in that case anyway. Has value
in the envisaged design for decoding of WAL.
No planned effect on Hot Standby.
Andres Freund, reviewed by me
Move discussion of why our algorithm for taking snapshots in recovery
to a more appropriate location in the function, and delete incorrect
mention of taking a lock.