Prior to commit 7882c3b0b9, it was
possible to use LWLocks within DSM segments, but that commit broke
this use case by switching from a doubly linked list to a circular
linked list. Switch back, using a new bit of general infrastructure
for maintaining lists of PGPROCs.
Thomas Munro, reviewed by me.
This coding pattern creates a race condition, because if an interesting
interrupt happens after we've checked InterruptPending but before we reset
our latch, the latch-setting done by the signal handler would get lost,
and then we might block at WaitLatch in the next iteration without ever
noticing the interrupt condition. You can put the CHECK_FOR_INTERRUPTS
before WaitLatch or after ResetLatch, but not between them.
Aside from fixing the bugs, add some explanatory comments to latch.h
to perhaps forestall the next person from making the same mistake.
In HEAD, also replace gather_readnext's direct call of
HandleParallelMessages with CHECK_FOR_INTERRUPTS. It does not seem clean
or useful for this one caller to bypass ProcessInterrupts and go straight
to HandleParallelMessages; not least because that fails to consider the
InterruptPending flag, resulting in useless work both here
(if InterruptPending isn't set) and in the next CHECK_FOR_INTERRUPTS call
(if it is).
This thinko seems to have been introduced in the initial coding of
storage/ipc/shm_mq.c (commit ec9037df2), and then blindly copied into all
the subsequent parallel-query support logic. Back-patch relevant hunks
to 9.4 to extirpate the error everywhere.
Discussion: <1661.1469996911@sss.pgh.pa.us>
Examination of the results from anole and gharial suggests that we're
only managing to track the size of one of the two stacks of IA64 machines.
Some googling gave the answer: on HPUX11, the register stack is reported
as a page type I don't see in pstat.h on my HPUX10 box. Let's try
testing for that.
After a look at preliminary results from commit 88cf37d2a8,
I realized it'd be a good idea to spew out the maximum depth measurement
seen by check_stack_depth. So add some quick-n-dirty code to do that.
Like the previous commit, this will be reverted once we've gathered
a set of buildfarm runs with it.
This patch is meant to gather information from the buildfarm members, and
will be reverted in a day or so. The idea is to try to find out the
high-water stack consumption while running the regression tests,
particularly on IA64 which is suspected to use much more stack than other
architectures. On machines with pmap, we can use that; but the IA64 farm
members are running HPUX, so also include some bespoke code for HPUX.
(I've tested the latter on HPUX 10/HPPA; not entirely sure it will work
on HPUX 11/IA64, but we'll soon find out.)
Discussion: <CAM-w4HMwwcwaVvYcAH0_FGtG5GeXdYVRfvG81pXnSJWHnCfosQ@mail.gmail.com>
Since indexes are created without valid LSNs, an index created
while a snapshot older than old_snapshot_threshold existed could
cause queries to return incorrect results when those old snapshots
were used, if any relevant rows had been subject to early pruning
before the index was built. Prevent usage of a newly created index
until all such snapshots are released, for relations where this can
happen.
Questions about the interaction of "snapshot too old" with index
creation were initially raised by Andres Freund.
Reviewed by Robert Haas.
Transmit the leader's temp-namespace state to workers. This is important
because without it, the workers do not really have the same search path
as the leader. For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.
We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.
Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session. While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway. We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables. (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)
Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open(). This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.
Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
Prior to this patch, it was occasionally possible, after shm_mq_sendv
had previously returned SHM_MQ_DETACHED, for a later shm_mq_sendv
operation to fail an assertion instead of just again returning
SHM_MQ_ATTACHED. From the shm_mq code's point of view, it was
expecting to be called again with the same arguments, since the
previous operation had only partially completed. However, a caller
who isn't using non-blocking mode won't be prepared to repeat the call
with the same arguments, and this code shouldn't expect that they
will. Repair in such a way that we'll be OK whether the next call
uses the same arguments or not.
Found by Andreas Seltenreich. Analysis and sketch of fix by Amit
Kapila. Patch by me, reviewed by Amit Kapila.
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.
Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
Use MAXALIGN size/alignment to guarantee that later uses of memory are
aligned correctly. E.g. epoll_event might need 8 byte alignment on some
platforms, but earlier allocations like WaitEventSet and WaitEvent might
not sized to guarantee that when purely using sizeof().
Found by myself while testing on an Sun Ultra 5 (Sparc IIi) with some
editorializing by Andres Freund.
In passing fix a couple typos in the area
BRIN was relying on the ability to remove a tuple from an index page,
then putting another tuple in the same line pointer. But PageAddItem
refuses to add a tuple beyond the first free item past the last used
item, and in particular, it rejects an attempt to add an item to an
empty page anywhere other than the first line pointer. PageAddItem
issues a WARNING and indicates to the caller that it failed, which in
turn causes the BRIN calling code to issue a PANIC, so the whole
sequence looks like this:
WARNING: specified item offset is too large
PANIC: failed to add BRIN tuple
To fix, create a new function PageAddItemExtended which is like
PageAddItem except that the two boolean arguments become a flags bitmap;
the "overwrite" and "is_heap" boolean flags in PageAddItem become
PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag
PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN.
PageAddItem() retains its original signature, for compatibility with
third-party modules (other callers in core code are not modified,
either).
Also, in the belt-and-suspenders spirit, I added a new sanity check in
brinGetTupleForHeapBlock to raise an error if an TID found in the revmap
is not marked as live by the page header. This causes it to react with
"ERROR: corrupted BRIN index" to the bug at hand, rather than a hard
crash.
Backpatch to 9.5.
Bug reported by Andreas Seltenreich as detected by his handy sqlsmith
fuzzer.
Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
Commit 1aba62ec moved the range check of that option form guc.c into
bufmgr.c, but introduced a bug by changing a >= 0.0 to > 0.0, which made
the value 0 no longer accepted. Put it back.
Reported by Jeff Janes, diagnosed by Tom Lane
Unfortunately the segment size checks from 72a98a6395 had the negative
side-effect of breaking a corner case in mdsync(): When processing a
fsync request for a truncated away segment mdsync() could fail with
"could not fsync file" (if previous segment < RELSEG_SIZE) because
_mdfd_getseg() now wouldn't return the relevant segment anymore.
The cleanest fix seems to be to allow the caller of _mdfd_getseg() to
specify whether checks for RELSEG_SIZE are performed. To allow doing so,
change the ExtensionBehavior enum into a bitmask. Besides allowing for
the addition of EXTENSION_DONT_CHECK_SIZE, this makes for a nicer
implementation of EXTENSION_REALLY_RETURN_NULL.
Besides mdsync() the only callsite that should change behaviour due to
this is mdprefetch() which now doesn't create segments anymore, even in
recovery. Given the uses of mdprefetch() that seems better.
Reported-By: Thom Brown
Discussion: CAA-aLv72QazLvPdKZYpVn4a_Eh+i4_cxuB03k+iCuZM_xjc+6Q@mail.gmail.com
Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not
verify whether all segments leading up to the to-be-opened one, were
RELSEG_SIZE sized. That is e.g. not the case after truncating a
relation, because later segments just get truncated to zero length, not
removed.
Once a "non-existent" segment has been opened in a session, mdnblocks()
will return wrong results, causing errors like "could not read block %u
in file" when accessing blocks. Closing the session, or the later
arrival of relevant invalidation messages, would "fix" the problem.
That, so far, was mostly harmless, because most segment accesses are
only done after an mdnblocks() call. But since 428b1d6b29 we try to
open segments that might have been deleted, to trigger kernel writeback
from a backend's queue of recent writes.
To fix check segment sizes in _mdfd_getseg() when opening previously
unopened segments. In practice this shouldn't imply a lot of additional
lseek() calls, because mdnblocks() will most of the time already have
opened all relevant segments.
This commit also fixes a second problem, namely that _mdfd_getseg(
EXTENSION_RETURN_NULL) extends files during recovery, which is not
desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL,
which does not behave that way, and use it.
Reported-By: Thom Brown
Author: Andres Freund, Abhijit Menon-Sen
Reviewd-By: Robert Haas, Fabien Coehlo
Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com
Fixes: 428b1d6b29
So far, when a transaction with pending invalidations, but without an
assigned xid, committed, we simply ignored those invalidation
messages. That's problematic, because those are actually sent for a
reason.
Known symptoms of this include that existing sessions on a hot-standby
replica sometimes fail to notice new concurrently built indexes and
visibility map updates.
The solution is to WAL log such invalidations in transactions without an
xid. We considered to alternatively force-assign an xid, but that'd be
problematic for vacuum, which might be run in systems with few xids.
Important: This adds a new WAL record, but as the patch has to be
back-patched, we can't bump the WAL page magic. This means that standbys
have to be updated before primaries; otherwise
"PANIC: standby_redo: unknown op code 32" errors can be encountered.
XXX:
Reported-By: Васильев Дмитрий, Masahiko Sawada
Discussion:
CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.comCAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
That commit increased all shared memory allocations to the next higher
multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation
started on a cache line boundary. It also failed to remove a couple
other pieces of now-useless code.
BUFFERALIGN() is perhaps obsolete at this point, and likely should be
removed at some point, too, but that seems like it can be left to a
future cleanup.
Mistakes all pointed out by Andres Freund. The patch is mine, with
a few extra assertions which I adopted from his version of this fix.
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
Coverity complained that oldPartitionLock was possibly dereferenced after
having been set to NULL. That actually can't happen, because we'd only use
it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is
justified in complaining, because at line 1275 we actually overwrite
oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to
whether to release the oldPartitionLock. Thus, the code would be incorrect
if someone else had changed the buffer's BM_TAG_VALID flag meanwhile.
That should not happen, since we hold pin on the buffer throughout this
sequence, but it's starting to look like a rather shaky chain of logic.
And there's no need for such assumptions, because we can simply replace
the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL),
which has identical results and makes it plain to all comers that we don't
dereference a null pointer. A small side benefit is that the range of
liveness of oldFlags is greatly reduced, possibly allowing the compiler
to save a register.
This is just cleanup, not an actual bug fix, so there seems no need
for a back-patch.
We've had repeated troubles over the years with failures to initialize
spinlocks correctly; see 6b93fcd14 for a recent example. Most of the time,
on most platforms, such oversights can escape notice because all-zeroes is
the expected initial content of an slock_t variable. The only platform
we have where the initialized state of an slock_t isn't zeroes is HPPA,
and that's practically gone in the wild. To make it easier to catch such
errors without needing one of those, adjust the --disable-spinlocks code
so that zero is not a valid value for an slock_t for it.
In passing, remove a bunch of unnecessary #include's from spin.c;
commit daa7527afc removed all the intermodule coupling that
made them necessary.
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
Commit 428b1d6b29 introduced the use of
msync() for flushing dirty data from the kernel's file buffers. Several
portability issues were overlooked, though:
* Not all implementations of mmap() think that nbytes == 0 means "map
the whole file". To fix, use lseek() to find out the true length.
Fix callers of pg_flush_data to be aware that nbytes == 0 may result
in trashing the file's seek position.
* Not all implementations of mmap() will accept partial-page mmap
requests. To fix, round down the length request to whatever sysconf()
says the page size is. (I think this is OK from a portability standpoint,
because sysconf() is required by SUS v2, and we aren't trying to compile
this part on Windows anyway. Buildfarm should let us know if not.)
* On 32-bit machines, the file size might exceed the available free
address space, or even exceed what will fit in size_t. Check for
the latter explicitly to avoid passing a false request size to mmap().
If mmap fails, silently fall through to the next implementation method,
rather than bleating to the postmaster log and giving up.
* mmap'ing directories fails on some platforms, and even if it works,
msync'ing the directory is quite unlikely to help, as for that matter are
the other flush implementations. In pre_sync_fname(), just skip flush
attempts on directories.
In passing, copy-edit the comments a bit.
Stas Kelvich and myself
On a big NUMA machine with 1000 connections in saturation load
there was a performance regression due to spinlock contention, for
acquiring values which were never used. Just fill with dummy
values if we're not going to use them.
This patch has not been benchmarked yet on a big NUMA machine, but
it seems like a good idea on general principle, and it seemed to
prevent an apparent 2.2% regression on a single-socket i7 box
running 200 connections at saturation load.
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.
Per gripes from Álvaro Herrera and Tom Lane
Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.
Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.
Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate. Experimentation shows the extending the relation
multiple blocks at a time improves scalability.
Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
Experimentation shows this only costs about 6kB, which seems well
worth it given the major performance effects that can be caused
by insufficient alignment, especially on larger systems.
Discussion: 14166.1458924422@sss.pgh.pa.us
If archive_timeout > 0 we should avoid logging XLOG_RUNNING_XACTS if idle.
Bug 13685 reported by Laurence Rowe, investigated in detail by Michael Paquier,
though this is not his proposed fix.
20151016203031.3019.72930@wrigleys.postgresql.org
Simple non-invasive patch to allow later backpatch to 9.4 and 9.5