Add a "copy" argument to make it optional to receive a copy of caller
tuple that is safe to use following a subsequent manipulating of
tuplesort's state. This is a performance optimization. Most existing
tuplesort_gettupleslot() callers are made to opt out of copying.
Existing callers that happen to rely on the validity of tuple memory
beyond subsequent manipulations of the tuplesort request their own
copy.
This brings tuplesort_gettupleslot() in line with
tuplestore_gettupleslot(). In the future, a "copy"
tuplesort_getdatum() argument may be added, that similarly allows
callers to opt out of receiving their own copy of tuple.
In passing, clarify assumptions that callers of other tuplesort fetch
routines may make about tuple memory validity, per gripe from Tom
Lane.
Author: Peter Geoghegan
Discussion: CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
Since hash indexes typically have very few overflow pages, adding a
new splitpoint essentially doubles the on-disk size of the index,
which can lead to large and abrupt increases in disk usage (and
perhaps long delays on occasion). To mitigate this problem to some
degree, divide larger splitpoints into four equal phases. This means
that, for example, instead of growing from 4GB to 8GB all at once, a
hash index will now grow from 4GB to 5GB to 6GB to 7GB to 8GB, which
is perhaps still not as smooth as we'd like but certainly an
improvement.
This changes the on-disk format of the metapage, so bump HASH_VERSION
from 2 to 3. This will force a REINDEX of all existing hash indexes,
but that's probably a good idea anyway. First, hash indexes from
pre-10 versions of PostgreSQL could easily be corrupted, and we don't
want to confuse corruption carried over from an older release with any
corruption caused despite the new write-ahead logging in v10. Second,
it will let us remove some backward-compatibility code added by commit
293e24e507.
Mithun Cy, reviewed by Amit Kapila, Jesper Pedersen and me. Regression
test outputs updated by me.
Discussion: http://postgr.es/m/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYty0jCf-pa+m+vYUJ716+AxM7nv_syvyanyf5O-L_i2A@mail.gmail.com
A QueryEnvironment concept is added, which allows new types of
objects to be passed into queries from parsing on through
execution. At this point, the only thing implemented is a
collection of EphemeralNamedRelation objects -- relations which
can be referenced by name in queries, but do not exist in the
catalogs. The only type of ENR implemented is NamedTuplestore, but
provision is made to add more types fairly easily.
An ENR can carry its own TupleDesc or reference a relation in the
catalogs by relid.
Although these features can be used without SPI, convenience
functions are added to SPI so that ENRs can easily be used by code
run through SPI.
The initial use of all this is going to be transition tables in
AFTER triggers, but that will be added to each PL as a separate
commit.
An incidental effect of this patch is to produce a more informative
error message if an attempt is made to modify the contents of a CTE
from a referencing DML statement. No tests previously covered that
possibility, so one is added.
Kevin Grittner and Thomas Munro
Reviewed by Heikki Linnakangas, David Fetter, and Thomas Munro
with valuable comments and suggestions from many others
The "Simplify tape block format" commit ignored the rule that blocks
returned by ltsGetFreeBlock() must be written out in the same order, at
least in the first write pass. To fix, relax that requirement, by making
ltsWriteBlock() to detect if it's about to create a "hole" in the
underlying BufFile, and fill it with zeros instead.
Reported, analysed, and reviewed by Peter Geoghegan.
Discussion: https://www.postgresql.org/message-id/CAM3SWZRWdNtkhiG0GyiX_1mUAypiK3dV6-6542pYe2iEL-foTA@mail.gmail.com
No more indirect blocks. The blocks form a linked list instead.
This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
Reviewed by Peter Geoghegan and Robert Haas.
Discussion: https://www.postgresql.org/message-id/34678beb-938e-646e-db9f-a7def5c44ada%40iki.fi
We allegedly allocated all remaining memory for the read buffers of the
sort tapes, but we allocated the merge heap only after that. That means
that the allocation of the merge heap was guaranteed to go over the memory
limit. Fix by allocating the merge heap first. This makes little difference
in practice, because the merge heap is tiny, but let's tidy.
While we're at it, add a safeguard for the case that we are already over
the limit when allocating the read buffers. That shouldn't happen, but
better safe than sorry.
The memory accounting error was reported off-list by Peter Geoghegan.
Have tuplesort_gettupleslot() copy the contents of its current table slot
as needed. This is based on an approach taken by tuplestore_gettupleslot().
In the future, tuplesort_gettupleslot() may also be taught to avoid copying
the tuple where caller can determine that that is safe (the
tuplestore_gettupleslot() interface already offers this option to callers).
Patch by Peter Geoghegan. Fixes bug #14344, reported by Regina Obe.
Report: <20160929035538.20224.39628@wrigleys.postgresql.org>
Backpatch-through: 9.6
Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.
This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
Reviewed by Peter Geoghegan
Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
LogicalTapeRewind() should not allocate large read buffer, if the tape
is completely empty. The calling code relies on that, for its
calculation of how much memory to allocate for the read buffers. That
lead to massive overallocation of memory, if maxTapes was high, but
only a few tapes were actually used.
Reported by Tomas Vondra
Discussion: <7303da46-daf7-9c68-3cc1-9f83235cf37e@2ndquadrant.com>
Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.
Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we
do an external sort.
Reviewed by Peter Geoghegan and Claudio Freire.
Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>
In external sort's merge phase, we maintain a binary heap holding the next
tuple from each input tape. On each step, the topmost tuple is returned,
and replaced with the next tuple from the same tape. We were doing the
replacement by deleting the top node in one operation, and inserting the
next tuple after that. However, you can do a "replace-top" operation more
efficiently, in one "sift-up". A deletion will always walk the heap from
top to bottom, but in a replacement, we can stop as soon as we find the
right place for the new tuple. This is particularly helpful, if the tapes
are not in completely random order, so that the next tuple from a tape is
likely to land near the top of the heap.
Peter Geoghegan, reviewed by Claudio Freire, with some editing by me.
Discussion: <CAM3SWZRhBhiknTF_=NjDSnNZ11hx=U_SEYwbc5vd=x7M4mMiCw@mail.gmail.com>
Negative availMemLessRefund would be problematic. It's not entirely
clear whether the case can be hit in the code as it stands, but this
seems like good future-proofing in any case. While we're at it,
insist that the value be not merely positive but not tiny, so as to
avoid doing a lot of repalloc work for little gain.
Peter Geoghegan
Discussion: <CAM3SWZRVkuUB68DbAkgw=532gW0f+fofKueAMsY7hVYi68MuYQ@mail.gmail.com>
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
Due to an error in the abbreviated key abort logic, the most recently
processed SortTuple could be incorrectly marked NULL, resulting in an
incorrect final sort order.
In the worst case, this could result in a corrupt btree index, which
would need to be rebuild using REINDEX. However, abbrevation doesn't
abort very often, not all data types use it, and only one tuple would
end up in the wrong place, so the practical impact of this mistake may
be somewhat limited.
Report and patch by Peter Geoghegan.
Otherwise, when we abandon incremental memory accounting and use
batch allocation for the final merge pass, we might crash. This
has been broken since 0011c0091e.
Peter Geoghegan, tested by Noah Misch
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
I'm committing these changes separately so that it's clear what is
Peter's original work versus what I changed. This is a followup to
commit 0011c0091e, and these changes
are all by me.
Introduce a new memory context which stores tuple data, and reset it
at the end of each merge pass; this helps avoid memory fragmentation
and, consequently, overallocation. Also, for the final merge patch,
eliminate memory context chunk header overhead entirely by allocating
all of the memory used for buffering tuples during the merge in a
single chunk. Since this modestly increases the number of tuples we
can store, grow the memtuples array a bit so that we're less likely to
run short of slots there.
Peter Geoghegan. Review and testing of patches in this series by
Jeff Janes, Greg Stark, Mithun Cy, and me.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.
Peter Geoghegan, reviewd by Andreas Karlsson and by me.
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
So far we have worked around the fact that some very old compilers do
not support 'inline' functions by only using inline functions
conditionally (or not at all). Since such compilers are very rare by
now, we have decided to rely on inline functions from 9.6 onwards.
To avoid breaking these old compilers inline is defined away when not
supported. That'll cause "function x defined but not used" type of
warnings, but since nobody develops on such compilers anymore that's
ok.
This change in policy will allow us to more easily employ inline
functions.
I chose to remove code previously conditional on PG_USE_INLINE as it
seemed confusing to have code dependent on a define that's always
defined.
Blacklisting of compilers, like in c53f73879f, now has to be done
differently. A platform template can define PG_FORCE_DISABLE_INLINE to
force inline to be defined empty.
Discussion: 20150701161447.GB30708@awork2.anarazel.de
The tuplesort/tuplestore memory management logic assumed that the chunk
allocation overhead for its memtuples array could not increase when
increasing the array size. This is and always was true for tuplesort,
but we (I, I think) blindly copied that logic into tuplestore.c without
noticing that the assumption failed to hold for the much smaller array
elements used by tuplestore. Given rather small work_mem, this could
result in an improper complaint about "unexpected out-of-memory situation",
as reported by Brent DeSpain in bug #13530.
The easiest way to fix this is just to increase tuplestore's initial
array size so that the assumption holds. Rather than relying on magic
constants, though, let's export a #define from aset.c that represents
the safe allocation threshold, and make tuplestore's calculation depend
on that.
Do the same in tuplesort.c to keep the logic looking parallel, even though
tuplesort.c isn't actually at risk at present. This will keep us from
breaking it if we ever muck with the allocation parameters in aset.c.
Back-patch to all supported versions. The error message doesn't occur
pre-9.3, not so much because the problem can't happen as because the
pre-9.3 tuplestore code neglected to check for it. (The chance of
trouble is a great deal larger as of 9.3, though, due to changes in the
array-size-increasing strategy.) However, allowing LACKMEM() to become
true unexpectedly could still result in less-than-desirable behavior,
so let's patch it all the way back.
It's standard for quicksort implementations, after having partitioned the
input into two subgroups, to recurse to process the smaller partition and
then handle the larger partition by iterating. This method guarantees
that no more than log2(N) levels of recursion can be needed. However,
Bentley and McIlroy argued that checking to see which partition is smaller
isn't worth the cycles, and so their code doesn't do that but just always
recurses on the left partition. In most cases that's fine; but with
worst-case input we might need O(N) levels of recursion, and that means
that qsort could be driven to stack overflow. Such an overflow seems to
be the only explanation for today's report from Yiqing Jin of a SIGSEGV
in med3_tuple while creating an index of a couple billion entries with a
very large maintenance_work_mem setting. Therefore, let's spend the few
additional cycles and lines of code needed to choose the smaller partition
for recursion.
Also, fix up the qsort code so that it properly uses size_t not int for
some intermediate values representing numbers of items. This would only
be a live risk when sorting more than INT_MAX bytes (in qsort/qsort_arg)
or tuples (in qsort_tuple), which I believe would never happen with any
caller in the current core code --- but perhaps it could happen with
call sites in third-party modules? In any case, this is trouble waiting
to happen, and the corrected code is probably if anything shorter and
faster than before, since it removes sign-extension steps that had to
happen when converting between int and size_t.
In passing, move a couple of CHECK_FOR_INTERRUPTS() calls so that it's
not necessary to preserve the value of "r" across them, and prettify
the output of gen_qsort_tuple.pl a little.
Back-patch to all supported branches. The odds of hitting this issue
are probably higher in 9.4 and up than before, due to the new ability
to allocate sort workspaces exceeding 1GB, but there's no good reason
to believe that it's impossible to crash older branches this way.