This also involves renaming src/include/utils/hashutils.h, which
becomes src/include/common/hashfn.h. Perhaps an argument can be
made for keeping the hashutils.h name, but it seemed more
consistent to make it match the name of the file, and also more
descriptive of what is actually going on here.
Patch by me, reviewed by Suraj Kharage and Mark Dilger. Off-list
advice on how not to break the Windows build from Davinder Singh
and Amit Kapila.
Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
Commit 4eaea3db introduced TupleHashTableHash(), but the signature
didn't match the other exposed functions. Separate it into internal
and external versions. The external version hides the details behind
an API more consistent with the other external functions, and the
internal version is still suitable for simplehash.
Expose two new entry points: one for only calculating the hash value
of a tuple, and another for looking up a hash entry when the hash
value is already known. This will be useful for disk-based Hash
Aggregation to avoid recomputing the hash value for the same tuple
after saving and restoring it from disk.
Discussion: https://postgr.es/m/37091115219dd522fd9ed67333ee8ed1b7e09443.camel%40j-davis.com
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.
In the passing, removed a couple of duplicate inclusions.
Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
In the course of 5567d12ce03, 356687bd8 and 317ffdfeaac, I changed
BuildTupleHashTable[Ext]'s call to ExecBuildGroupingEqual to not pass
in the parent node, but NULL. Which in turn prevents the tuple
equality comparator from being JIT compiled. While that fixes
bug #15486, it is not actually necessary after all of the above commits,
as we don't re-build the comparator when using the new
BuildTupleHashTableExt() interface (as the content of the hashtable
are reset, but the TupleHashTable itself is not).
Therefore re-allow jit compilation for callers that use
BuildTupleHashTableExt with a separate context for "metadata" and
content.
As in the previous commit, there's ongoing work to make this easier to
test to prevent such regressions in the future, but that
infrastructure is not going to be backpatchable.
The performance impact of not JIT compiling hashtable equality
comparators can be substantial e.g. for aggregation queries that
aggregate a lot of input rows to few output rows (when there are a lot
of output groups, there will be fewer comparisons).
Author: Andres Freund
Discussion: https://postgr.es/m/20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de
Backpatch: 11, just as 5567d12ce03
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.
access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.
Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.
To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h. (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)
Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql
This has the advantage that the comparator expression, the table's
slot, etc do not have to be rebuilt. Additionally the simplehash.h
hashtable within the tuple hashtable now keeps its previous size and
doesn't need to be reallocated. That both reduces allocator overhead,
and improves performance in cases where the input estimation was off
by a significant factor.
To avoid an API/ABI break, the new parameter is exposed via the new
BuildTupleHashTableExt(), and BuildTupleHashTable() now is a wrapper
around the former, that continues to allocate the table itself in the
tablecxt.
Using this fixes performance issues discovered in the two bugs
referenced. This commit however has not converted the callers, that's
done in a separate commit.
Bug: #15592#15486
Reported-By: Jakub Janeček, Dmitry Marakasov
Author: Andres Freund
Discussion:
https://postgr.es/m/15486-05850f065da42931@postgresql.orghttps://postgr.es/m/20190114180423.ywhdg2iagzvh43we@alap3.anarazel.de
Backpatch: 11, this is a prerequisite for other fixes
In bf6c614a2f2c5 I added a expr context to evaluate the grouping
expression. Unfortunately the code I added initialized them while in
the calling context, rather the table context. Additionally, I used
CreateExprContext() rather than CreateStandaloneExprContext(), which
creates the econtext in the estate's query context.
Fix that by using CreateStandaloneExprContext when in the table's
tablecxt. As we rely on the memory being freed by a memory context
reset that means that the econtext's shutdown callbacks aren't being
called, but that seems ok as the expressions are tightly controlled
due to ExecBuildGroupingEqual().
Bug: #15592
Reported-By: Dmitry Marakasov
Author: Andres Freund
Discussion: https://postgr.es/m/20190114222838.h6r3fuyxjxkykf6t@alap3.anarazel.de
Backpatch: 11, where I broke this in bf6c614a2f2c5
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
This has a performance benefit on own, although not hugely so. The
primary benefit is that it will allow for to JIT tuple deforming and
comparator invocations.
Large parts of this were previously committed (773aec7aa), but the
commit contained an omission around cross-type comparisons and was
thus reverted.
Author: Andres Freund
Discussion: https://postgr.es/m/20171129080934.amqqkke2zjtekd4t@alap3.anarazel.de
This reverts commit 773aec7aa98abd38d6d9435913bb8e14e392c274.
There's an unresolved issue in the reverted commit: It only creates
one comparator function, but in for the nodeSubplan.c case we need
more (c.f. FindTupleHashEntry vs LookupTupleHashEntry calls in
nodeSubplan.c).
This isn't too difficult to fix, but it's not entirely trivial
either. The fact that the issue only causes breakage on 32bit systems
shows that the current test coverage isn't that great. To avoid
turning half the buildfarm red till those two issues are addressed,
revert.
The changes in b81b5a96f424531b97cdd1dba97d9d1b9c9d372e did not fully
address the issue, because the bit-mixing of the IV into the final
hash-key didn't prevent clustering in the input-data survive in the
output data.
This didn't cause a lot of problems because of the additional growth
conditions added d4c62a6b623d6eef88218158e9fa3cf974c6c7e5. But as we
want to rein those in due to explosive growth in some edges, this
needs to be fixed.
Author: Andres Freund
Discussion: https://postgr.es/m/20171127185700.1470.20362@wrigleys.postgresql.org
Backpatch: 10, where simplehash was introduced
Even if we don't emit definitions for SH_ALLOCATE and SH_FREE, we
still need prototypes. The user can't define them before including
simplehash.h because SH_TYPE isn't available yet.
For the allocator to be able to access private_data, it needs to
become an argument to SH_CREATE. Previously we relied on callers
to set that after returning from SH_CREATE, but SH_CREATE calls
SH_ALLOCATE before returning.
Dilip Kumar, reviewed by me.
This is infrastructure for a pending patch to allow parallel bitmap
heap scans.
Dilip Kumar, reviewed (in earlier versions) by Andres Freund and
(more recently) by me. Some further renaming by me, also.
Commit 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5 introduced the use
of a new type of hash table with linear reprobing for hash aggregates.
Such a hash table behaves very poorly if keys are inserted in hash
order, which does in fact happen in the case where a query use a
Finalize HashAggregate node fed (via Gather) by a Partial
HashAggregate node. In fact, queries with this type of plan tend
to run effectively forever.
Fix that by seeding the hash value differently in each worker
(and in the leader, if it participates).
Andres Freund and Robert Haas
The more efficient hashtable speeds up hash-aggregations with more than
a few hundred groups significantly. Improvements of over 120% have been
measured.
Due to the the different hash table queries that not fully
determined (e.g. GROUP BY without ORDER BY) may change their result
order.
The conversion is largely straight-forward, except that, due to the
static element types of simplehash.h type hashes, the additional data
some users store in elements (e.g. the per-group working data for hash
aggregaters) is now stored in TupleHashEntryData->additional. The
meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
has been changed to only be about the additionally stored size. That
size is only used for the initial sizing of the hash-table.
Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
The planner can sometimes compute very large values for numGroups, and in
cases where we have no alternative to building a hashtable, such a value
will get fed directly to BuildTupleHashTable as its nbuckets parameter.
There were two ways in which that could go bad. First, BuildTupleHashTable
declared the parameter as "int" but most callers were passing "long"s,
so on 64-bit machines undetected overflow could occur leading to a bogus
negative value. The obvious fix for that is to change the parameter to
"long", which is what I've done in HEAD. In the back branches that seems a
bit risky, though, since third-party code might be calling this function.
So for them, just put in a kluge to treat negative inputs as INT_MAX.
Second, hash_create can go nuts with extremely large requested table sizes
(notably, my_log2 becomes an infinite loop for inputs larger than
LONG_MAX/2). What seems most appropriate to avoid that is to bound the
initial table size request to work_mem.
This fixes bug #6035 reported by Daniel Schreiber. Although the reported
case only occurs back to 8.4 since it involves WITH RECURSIVE, I think
it's a good idea to install the defenses in all supported branches.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
that aren't turned into true joins). Since this is the last missing bit of
infrastructure, go ahead and fill out the hash integer_ops and float_ops
opfamilies with cross-type operators. The operator family project is now
DONE ... er, except for documentation ...
Hashing for aggregation purposes still needs work, so it's not time to
mark any cross-type operators as hashable for general use, but these cases
work if the operators are so marked by hand in the system catalogs.
which comparison operators to use for plan nodes involving tuple comparison
(Agg, Group, Unique, SetOp). Formerly the executor looked up the default
equality operator for the datatype, which was really pretty shaky, since it's
possible that the data being fed to the node is sorted according to some
nondefault operator class that could have an incompatible idea of equality.
The planner knows what it has sorted by and therefore can provide the right
equality operator to use. Also, this change moves a couple of catalog lookups
out of the executor and into the planner, which should help startup time for
pre-planned queries by some small amount. Modify the planner to remove some
other cavalier assumptions about always being able to use the default
operators. Also add "nulls first/last" info to the Plan node for a mergejoin
--- neither the executor nor the planner can cope yet, but at least the API is
in place.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.