corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
identical to tuplestore_puttuple(), except it operates on arrays of
Datums + nulls rather than a fully-formed HeapTuple. In several places
that use the tuplestore API, this means we can avoid creating a
HeapTuple altogether, saving a copy.
oprofile shows that a nontrivial amount of time is being spent in
repeated calls to index_getprocinfo, which really only needs to be
called once. So do that, and inline _hash_datum2hashkey to make it
work.
bucket number, so as to ensure locality of access to the index during the
insertion step. Without this, building an index significantly larger than
available RAM takes a very long time because of thrashing. On the other
hand, sorting is just useless overhead when the index does fit in RAM.
We choose to sort when the initial index size exceeds effective_cache_size.
This is a revised version of work by Tom Raney and Shreya Bhargava.
than dividing them into 1GB segments as has been our longtime practice. This
requires working support for large files in the operating system; at least for
the time being, it won't be the default.
Zdenek Kotala
regardless of the number of tuples involved, it's incorrect to skip it
when memtupcount = 1; the number of cycles saved is minuscule anyway.
An alternative solution would be to pull the state changes out to the
call site in tuplesort_performsort, but keeping them near the corresponding
changes in make_bounded_heap seems marginally cleaner. Noticed by
Greg Stark.
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.) Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
tablespace(s) in which to store temp tables and temporary files. This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created). Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.
Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
is using mark/restore but not rewind or backward-scan capability. Insert a
materialize plan node between a mergejoin and its inner child if the inner
child is a sort that is expected to spill to disk. The materialize shields
the sort from the need to do mark/restore and thereby allows it to perform
its final merge pass on-the-fly; while the materialize itself is normally
cheap since it won't spill to disk unless the number of tuples with equal
key values exceeds work_mem.
Greg Stark, with some kibitzing from Tom Lane.
need be returned. We keep a heap of the current best N tuples and sift-up
new tuples into it as we scan the input. For M input tuples this means
only about M*log(N) comparisons instead of M*log(M), not to mention a lot
less workspace when N is small --- avoiding spill-to-disk for large M
is actually the most attractive thing about it. Patch includes planner
and executor support for invoking this facility in ORDER BY ... LIMIT
queries. Greg Stark, with some editorialization by moi.
which comparison operators to use for plan nodes involving tuple comparison
(Agg, Group, Unique, SetOp). Formerly the executor looked up the default
equality operator for the datatype, which was really pretty shaky, since it's
possible that the data being fed to the node is sorted according to some
nondefault operator class that could have an incompatible idea of equality.
The planner knows what it has sorted by and therefore can provide the right
equality operator to use. Also, this change moves a couple of catalog lookups
out of the executor and into the planner, which should help startup time for
pre-planned queries by some small amount. Modify the planner to remove some
other cavalier assumptions about always being able to use the default
operators. Also add "nulls first/last" info to the Plan node for a mergejoin
--- neither the executor nor the planner can cope yet, but at least the API is
in place.
per-column options for btree indexes. The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options. The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.
Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass. This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
cases. Operator classes now exist within "operator families". While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.
This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later. Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default. I owe some more documentation work, too. But that can all
be done in smaller pieces once this infrastructure is in place.
repeatedly. Now that we don't have to worry about memory leaks from
glibc's qsort, we can safely put CHECK_FOR_INTERRUPTS into the tuplesort
comparators, as was requested a couple months ago. Also, get rid of
non-reentrancy and an extra level of function call in tuplesort.c by
providing a variant qsort_arg() API that passes an extra void * argument
through to the comparison routine. (We might want to use that in other
places too, I didn't look yet.)
per-tuple space overhead for sorts in memory. I chose to replace the
previous patch that tried to write out the bare minimum amount of data
when sorting on disk; instead, just dump the MinimalTuples as-is. This
wastes 3 to 10 bytes per tuple depending on architecture and null-bitmap
length, but the simplification in the writetup/readtup routines seems
worth it.
tuples with less header overhead than a regular HeapTuple, per my
recent proposal. Teach TupleTableSlot code how to deal with these.
As proof of concept, change tuplestore.c to store MinimalTuples instead
of HeapTuples. Future patches will expand the concept to other places
where it is useful.
and transaction visibility fields of tuples being sorted. These are
always uninteresting in a tuple being sorted (if the fields were actually
selected, they'd have been pulled out into user columns beforehand).
This saves about 24 bytes per row being sorted, which is a useful savings
for any but the widest of sort rows. Per recent discussion.
case where we run low on array slots before we run low on memory is much
more probable than I had thought, and so it's important to treat each
tape fairly in that case. To fix this, track per-tape slot allocations
just like we track per-tape space allocation. Also, in the FINALMERGE
code path avoid scanning all the input tapes when we really only need to
read from one. This should fix poor behavior with very large work_mem
as exhibited by Stefan Kaltenbrunner.
I didn't do anything about putting an upper bound on the number of tapes,
but maybe we should still consider that.
performance issue during regular merge passes not only the 'final merge'
case. The original design contemplated that there would never be more
than about one free block per 'tape', hence no need for an efficient
method of keeping the free blocks sorted. But given the later addition
of merge preread behavior in tuplesort.c, there is likely to be about
work_mem worth of free blocks, which is not so small ... and for that
matter the number of tapes isn't necessarily small anymore either. So
we'd better get rid of the assumption entirely. Instead, I'm assuming
that the usage pattern will involve alternation between merge preread
and writing of a new run. This makes it reasonable to just add blocks
to the list without sorting during successive ltsReleaseBlock calls,
and then do a qsort() when we start getting ltsGetFreeBlock() calls.
Experimentation seems to confirm that there aren't many qsort calls
relative to the number of ltsReleaseBlock/ltsGetFreeBlock calls.
we are doing the final merge pass on-the-fly, and not writing the data
back onto a 'tape', the number of free blocks in the tape set will become
large, leading to a lot of time wasted in ltsReleaseBlock(). There is
really no need to track the free blocks anymore in this state, so add a
simple shutoff switch. Per report from Stefan Kaltenbrunner.
In particular, ensure that enlargement of the memtuples[] array doesn't
fall foul of MaxAllocSize when work_mem is very large, and don't bother
enlarging it if that would force an immediate switch into 'tape' mode anyway.
we'll go over to disk-based sort if we reach that limit.
This fixes Stefan Kaltenbrunner's observation that sorting can suffer an
'invalid memory alloc request size' failure when sort_mem is set large
enough. It's unfortunately not so easy to fix in 8.1 ...
each tuple, as per my proposal of several days ago. Also, clean up
sort memory management by keeping all working data in a separate memory
context, and refine the handling of low-memory conditions.
allocates the control data. The per-tape buffers are allocated only
on first use. This saves memory in situations where tuplesort.c
overestimates the number of tapes needed (ie, there are fewer runs
than tapes). Also, this makes legitimate the coding in inittapes()
that includes tape buffer space in the maximum-memory calculation:
when inittapes runs, we've already expended the whole allowed memory
on tuple storage, and so we'd better not allocate all the tape buffers
until we've flushed some tuples out of memory.
with fixed merge order (fixed number of "tapes") was based on obsolete
assumptions, namely that tape drives are expensive. Since our "tapes"
are really just a couple of buffers, we can have a lot of them given
adequate workspace. This allows reduction of the number of merge passes
with consequent savings of I/O during large sorts.
Simon Riggs with some rework by Tom Lane
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
which is neither needed by nor related to that header. Remove the bogus
inclusion and instead include the header in those C files that actually
need it. Also fix unnecessary inclusions and bad inclusion order in
tsearch2 files.