During dumptuples() the call to writetuple() would pfree any non-null
tuple. This was quite wasteful as this happens just before we perform a
reset of the context which stores all of those tuples.
It seems to make sense to do a bit of a code refactor to make this work,
so here we just get rid of the writetuple function and adjust the WRITETUP
macro to call the state's writetup function. The WRITETUP usage in
mergeonerun() always has state->slabAllocatorUsed == true, so writetuple()
would never free the tuple or do any memory accounting. The only call
path that needs memory accounting done is in dumptuples(), so let's just
do it manually there.
In passing, let's get rid of the state->memtupcount-- code that counts the
memtupcount down to 0 one tuple at a time inside the loop. That seems to
be a rather inefficient way to set memtupcount to 0, so let's just zero it
after the loop instead.
Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvqZXoDCyrfCzZJR0-xH+7_q+GgitcQiYXUjRani7h4j8Q@mail.gmail.com
This commit puts the implementation of Tuple sort variants into the separate
file tuplesortvariants.c. That gives better separation of the code and
serves well as the demonstration that Tuple sort variant can be defined outside
of tuplesort.c.
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
The new TuplesortPublic data structure contains the definition of
sort-variant-specific interface methods and the part of Tuple sort operation
state required by their implementations. This will let define Tuple sort
variants without knowledge of Tuplesortstate, that is without knowledge
of generic sort implementation guts.
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
This commit puts some generic work away from sort-variant-specific function.
In particular, tuplesort_put*() now doesn't need to decrease available memory
and switch to sort context before calling puttuple_common(). writetup()
doesn't need to free SortTuple.tuple and increase available memory.
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
Abbreviation code is very similar along tuplesort_put*() functions. This
commit unifies that code and puts it into puttuple_common(). tuplesort_put*()
functions differs in the abbreviation condition, so it has been added as an
argument to the puttuple_common() function.
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
This commit is the preparation to move abbreviation logic into
puttuple_common(). The new removeabbrev function turns datum1 representation
of SortTuple's from the abbreviated key to the first column value. Therefore,
it encapsulates the differential part of abbreviation handling code in
tuplesort_put*() functions, making these functions similar.
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
It's currently unclear how do we split functionality between
Tuplesortstate.copytup() function and tuplesort_put*() functions.
For instance, copytup_index() and copytup_datum() return error while
tuplesort_putindextuplevalues() and tuplesort_putdatum() do their work.
This commit removes Tuplesortstate.copytup() altogether, putting the
corresponding code into tuplesort_put*().
Discussion: https://postgr.es/m/CAPpHfdvjix0Ahx-H3Jp1M2R%2B_74P-zKnGGygx4OWr%3DbUQ8BNdw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Pavel Borisov, Maxim Orlov, Matthias van de Meent
Reviewed-by: Andres Freund, John Naylor
40af10b57 changed things so we make use of a generation memory context for
storing tuples to be sorted by tuplesort.c. That change does not play
nicely with the changes made in 9f03ca915 (back in 2014). That commit
changed things so that index_form_tuple() is called while switched into
the tuplestore's tuplecontext. In order to fetch the tuple from the index,
index_form_tuple() must do various memory allocations which are unrelated
to the storage of the final returned tuple. Although all of these
allocations are pfree'd, the fact that we now use a generation context
means that the memory for these pfree'd allocations won't be used again by
any other allocation due to generation.c's lack of freelists. This could
result in sorts used for building indexes exceeding maintenance_work_mem
by a very large amount.
Here we fix it so we no longer allocate anything apart from the tuple
itself into the generation context by adding a new version of
index_form_tuple named index_form_tuple_context, which can be called to
specify the MemoryContext to allocate the tuple into.
Discussion: https://postgr.es/m/CAApHDvrHQkiFRHiGiAS-LMOvJN-eK-s762=tVzBz8ZqUea-a_A@mail.gmail.com
Backpatch-through: 15, where 40af10b57 was added.
697492434 added 3 new quicksort specialization functions for common
datatypes.
That commit was not very consistent in how it would determine if we're
compiling for 32-bit or 64-bit machines. It would sometimes use
USE_FLOAT8_BYVAL and at other times check if SIZEOF_DATUM == 8. This
could cause theoretical problems due to the way USE_FLOAT8_BYVAL is now
defined based on SIZEOF_VOID_P >= 8. If pointers for some reason were
ever larger than 8-bytes then we'd end up doing 32-bit comparisons
mistakenly. Let's just always check SIZEOF_DATUM >= 8.
It also seems that ssup_datum_signed_cmp is just never used on 32-bit
builds, so let's just ifdef that out to make sure we never accidentally
use that comparison function on such machines. This also allows us to
ifdef out 1 of the 3 new specialization quicksort functions in 32-bit
builds which seems to shrink down the binary by over 4KB on my machine.
In passing, also add the missing DatumGetInt32() / DatumGetInt64() macros
in the comparison functions.
Discussion: https://postgr.es/m/CAApHDvqcQExRhtRa9hJrJB_5egs3SUfOcutP3m+3HO8A+fZTPA@mail.gmail.com
Reviewed-by: John Naylor
697492434 added 3 new qsort specialization functions aimed to improve the
performance of sorting many of the common pass-by-value data types when
they're the leading or only sort key.
Unfortunately, that has caused a performance regression when sorting
datasets where many of the values being compared were equal. What was
happening here was that we were falling back to the standard sort
comparison function to handle tiebreaks. When the two given Datums
compared equally we would incur both the overhead of an indirect function
call to the standard comparer to perform the tiebreak and also the
standard comparer function would go and compare the leading key needlessly
all over again.
Here improve the situation in the 3 new comparison functions. We now
return 0 directly when the two Datums compare equally and we're performing
a 1-key sort.
Here we don't do anything to help the multi-key sort case where the
leading key uses one of the sort specializations functions. On testing
this case, even when the leading key's values are all equal, there
appeared to be no performance regression. Let's leave it up to future
work to optimize that case so that the tiebreak function no longer
re-compares the leading key over again.
Another possible fix for this would have been to add 3 additional sort
specialization functions to handle single-key sorts for these
pass-by-value types. The reason we didn't do that here is that we may
deem some other sort specialization to be more useful than single-key
sorts. It may be impractical to have sort specialization functions for
every single combination of what may be useful and it was already decided
that further analysis into which ones are the most useful would be delayed
until the v16 cycle. Let's not let this regression force our hand into
trying to make that decision for v15.
Author: David Rowley
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CA+hUKGJRbzaAOUtBUcjF5hLtaSHnJUqXmtiaLEoi53zeWSizeA@mail.gmail.com
CLUSTER sort won't use the datum1 SortTuple field when clustering
against an index whose leading key is an expression. This makes it
unsafe to use the abbreviated keys optimization, which was missed by the
logic that sets up SortSupport state. Affected tuplesorts output tuples
in a completely bogus order as a result (the wrong SortSupport based
comparator was used for the leading attribute).
This issue is similar to the bug fixed on the master branch by recent
commit cc58eecc5d. But it's a far older issue, that dates back to the
introduction of the abbreviated keys optimization by commit 4ea51cdfe8.
Backpatch to all supported versions.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKG+bA+bmwD36_oDxAoLrCwZjVtST2fqe=b4=qZcmU7u89A@mail.gmail.com
Backpatch: 10-
The general usage pattern when we store tuples in tuplesort.c is that
we store a series of tuples one by one then either perform a sort or spill
them to disk. In the common case, there is no pfreeing of already stored
tuples. For the common case since we do not individually pfree tuples, we
have very little need for aset.c memory allocation behavior which
maintains freelists and always rounds allocation sizes up to the next
power of 2 size.
Here we conditionally use generation.c contexts for storing tuples in
tuplesort.c when the sort will never be bounded. Unfortunately, the
memory context to store tuples is already created by the time any calls
would be made to tuplesort_set_bound(), so here we add a new sort option
that allows callers to specify if they're going to need a bounded sort or
not. We'll use a standard aset.c allocator when this sort option is not
set.
Extension authors must ensure that the TUPLESORT_ALLOWBOUNDED flag is
used when calling tuplesort_begin_* for any sorts that make a call to
tuplesort_set_bound().
Author: David Rowley
Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf++8JJ0paw+03dk+W25tQEcNQ@mail.gmail.com
This replaces the bool flag for randomAccess. An upcoming patch requires
adding another option, so instead of breaking the API for that, then
breaking it again one day if we add more options, let's just break it
once. Any boolean options we add in the future will just make use of an
unused bit in the flags.
Any extensions making use of tuplesorts will need to update their code
to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess.
TUPLESORT_NONE can be used for a set of empty options.
Author: David Rowley
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
When dispatching sort operations to specialized variants, commit
69749243 failed to handle the case where CLUSTER-sort decides not to
initialize datum1 and isnull1. Fix by hoisting that decision up a level
and advertising whether datum1 can be relied on, in the Tuplesortstate
object.
Per reports from UBsan and Valgrind build farm animals, while running
the cluster.sql test.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAFBsxsF1TeK5Fic0M%2BTSJXzbKsY6aBqJGNj6ptURuB09ZF6k_w%40mail.gmail.com
Previously, the specialized tuplesort routine inlined handling for
reverse-sort and NULLs-ordering but called the datum comparator via a
pointer in the SortSupport struct parameter. Testing has showed that we
can get a useful performance gain by specializing datum comparison for
the different representations of abbreviated keys -- signed and unsigned
64-bit integers and signed 32-bit integers. Almost all abbreviatable data
types will benefit -- the only exception for now is numeric, since the
datum comparison is more complex. The performance gain depends on data
type and input distribution, but often falls in the range of 10-20% faster.
Thomas Munro
Reviewed by Peter Geoghegan, review and performance testing by me
Discussion:
https://www.postgresql.org/message-id/CA%2BhUKGKKYttZZk-JMRQSVak%3DCXSJ5fiwtirFf%3Dn%3DPAbumvn1Ww%40mail.gmail.com
The SQL standard has been ambiguous about whether null values in
unique constraints should be considered equal or not. Different
implementations have different behaviors. In the SQL:202x draft, this
has been formalized by making this implementation-defined and adding
an option on unique constraint definitions UNIQUE [ NULLS [NOT]
DISTINCT ] to choose a behavior explicitly.
This patch adds this option to PostgreSQL. The default behavior
remains UNIQUE NULLS DISTINCT. Making this happen in the btree code
is pretty easy; most of the patch is just to carry the flag around to
all the places that need it.
The CREATE UNIQUE INDEX syntax extension is not from the standard,
it's my own invention.
I named all the internal flags, catalog columns, etc. in the negative
("nulls not distinct") so that the default PostgreSQL behavior is the
default if the flag is false.
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/84e5ee1b-387e-9a54-c326-9082674bde78@enterprisedb.com
In selectnewtape(), use 'nOutputTapes' rather than 'nOutputRuns' in the
check for whether to start a new tape or to append a new run to an
existing tape. Until 'maxTapes' is reached, nOutputTapes is always equal
to nOutputRuns, so it doesn't change the logic, but it seems more logical
to compare # of tapes with # of tapes. Also, currently maxTapes is never
modified after the merging begins, but written this way, the code would
still work if it was. (Although the nOutputRuns == nOutputTapes assertion
would need to be removed and using nOutputRuns % nOutputTapes to
distribute the runs evenly across the tapes wouldn't do a good job
anymore).
Similarly in mergeruns(), change to USEMEM(state->tape_buffer_mem) to
account for the memory used for tape buffers. It's equal to availMem
currently, but tape_buffer_mem is more direct and future-proof. For
example, if we changed the logic to only allocate half of the remaining
memory to tape buffers, USEMEM(state->tape_buffer_mem) would still be
correct.
Coverity complained about these. Hopefully this patch helps it to
understand the logic better. Thanks to Tom Lane for initial analysis.
The code for initializing the tapes on each merge iteration was skipped
in a parallel worker. I put the !WORKER(state) check in wrong place while
rebasing the patch.
That caused failures in the index build in 'multiple-row-versions'
isolation test, in multiple buildfarm members. On my laptop it was easier
to reproduce by building an index on a larger table, so that you got a
parallel sort more reliably.
The previous commit 65014000b3 changed the variable passed to elog
from an int64 to a size_t variable, but neglected to change the modifier
in the format string accordingly.
Per failure on buildfarm member lapwing.
The advantage of polyphase merge is that it can reuse the input tapes as
output tapes efficiently, but that is irrelevant on modern hardware, when
we can easily emulate any number of tape drives. The number of input tapes
we can/should use during merging is limited by work_mem, but output tapes
that we are not currently writing to only cost a little bit of memory, so
there is no need to skimp on them.
This makes sorts that need multiple merge passes faster.
Discussion: https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83%40iki.fi
Reviewed-by: Peter Geoghegan, Zhihong Yu, John Naylor
All the tape functions, like LogicalTapeRead and LogicalTapeWrite, now
take a LogicalTape as argument, instead of LogicalTapeSet+tape number.
You can create any number of LogicalTapes in a single LogicalTapeSet, and
you don't need to decide the number upfront, when you create the tape set.
This makes the tape management in hash agg spilling in nodeAgg.c simpler.
Discussion: https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58d83%40iki.fi
Reviewed-by: Peter Geoghegan, Zhihong Yu, John Naylor
41469253e went to the trouble of removing a theoretical bug from
free_sort_tuple by checking if the tuple was NULL before freeing it. Let's
make this a little more robust by also setting the tuple to NULL so that
should we be called again we won't end up doing a pfree on the already
pfree'd tuple. Per advice from Tom Lane.
Discussion: https://postgr.es/m/3188192.1626136953@sss.pgh.pa.us
Backpatch-through: 9.6, same as 41469253e
This fixes a theoretical bug in tuplesort.c which, if a bounded sort was
used in combination with a byval Datum sort (tuplesort_begin_datum), when
switching the sort to a bounded heap in make_bounded_heap(), we'd call
free_sort_tuple(). The problem was that when sorting Datums of a byval
type, the tuple is NULL and free_sort_tuple() would free the memory for it
regardless of that. This would result in a crash.
Here we fix that simply by adding a check to see if the tuple is NULL
before trying to disassociate and free any memory belonging to it.
The reason this bug is only theoretical is that nowhere in the current
code base do we do tuplesort_set_bound() when performing a Datum sort.
However, let's backpatch a fix for this as if any extension uses the code
in this way then it's likely to cause problems.
Author: Ronan Dunklau
Discussion: https://postgr.es/m/CAApHDvpdoqNC5FjDb3KUTSMs5dg6f+XxH4Bg_dVcLi8UYAG3EQ@mail.gmail.com
Backpatch-through: 9.6, oldest supported version
This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by
one. The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.
The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.
As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.
Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
Since the introduction of different slot types, in 1a0586de36, we
create a virtual slot in tuplesort_begin_cluster(). While that looks
right, it unfortunately doesn't actually work, as ExecStoreHeapTuple()
is used to store tuples in the slot. Unfortunately no regression tests
for CLUSTER on expression indexes existed so far.
Fix the slot type, and add bare bones tests for CLUSTER on expression
indexes.
Reported-By: Justin Pryzby
Author: Andres Freund
Discussion: https://postgr.es/m/20191011210320.GS10470@telsasoft.com
Backpatch: 12, like 1a0586de36
Rename the "tupindex" field from tuplesort.c's SortTuple struct to
"srctape", since it can only ever be used to store a source/input tape
number when merging external sort runs. This has been the case since
commit 8b304b8b72, which removed replacement selection sort from
tuplesort.c.
READTUP() routines do not and cannot use the resettable "tuplecontext"
memory context, since it is deleted when merging begins. Update an
obsolete comment that claimed otherwise. This was an oversight in
commit e94568ecc1.
In passing, fix an unrelated tuplesort typo.
Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique. This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".
Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially in a multi-column index. Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now. A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.
Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved). Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3. contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes. These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.
A later patch will enhance the logic used by nbtree to pick a split
point. This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at. Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.
The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits. The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised. However, there should be a compatibility note in the v12
release notes.
Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
Use dedicated struct to represent nbtree insertion scan keys. Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions. This is based on a suggestion by Andrey
Lepikhov.
Streamline how unique index insertions cache binary search progress.
Cache the state of in-progress binary searches within _bt_check_unique()
for later instead of having callers avoid repeating the binary search in
an ad-hoc manner. This makes it easy to add a new optimization:
_bt_check_unique() now falls out of its loop immediately in the common
case where it's already clear that there couldn't possibly be a
duplicate.
The new _bt_check_unique() scheme makes it a lot easier to manage cached
binary search effort afterwards, from within _bt_findinsertloc(). This
is needed for the upcoming patch to make nbtree tuples unique by
treating heap TID as a final tiebreaker column. Unique key binary
searches need to restore lower and upper bounds. They cannot simply
continue to use the >= lower bound as the offset to insert at, because
the heap TID tiebreaker column must be used in comparisons for the
restored binary search (unlike the original _bt_check_unique() binary
search, where scankey's heap TID column must be omitted).
Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
... as well as its implementation from backend/access/hash/hashfunc.c to
backend/utils/hash/hashfn.c.
access/hash is the place for the hash index AM, not really appropriate
for generic facilities, which is what hash_any is; having things the old
way meant that anything using hash_any had to include the AM's include
file, pointlessly polluting its namespace with unrelated, unnecessary
cruft.
Also move the HTEqual strategy number to access/stratnum.h from
access/hash.h.
To avoid breaking third-party extension code, add an #include
"utils/hashutils.h" to access/hash.h. (An easily removed line by
committers who enjoy their asbestos suits to protect them from angry
extension authors.)
Discussion: https://postgr.es/m/201901251935.ser5e4h6djt2@alvherre.pgsql