mirror of
https://github.com/postgres/postgres.git
synced 2025-05-28 05:21:27 +03:00
444 lines
17 KiB
Plaintext
444 lines
17 KiB
Plaintext
<!-- doc/src/sgml/btree.sgml -->
|
|
|
|
<chapter id="btree">
|
|
<title>B-Tree Indexes</title>
|
|
|
|
<indexterm>
|
|
<primary>index</primary>
|
|
<secondary>B-Tree</secondary>
|
|
</indexterm>
|
|
|
|
<sect1 id="btree-intro">
|
|
<title>Introduction</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> includes an implementation of the
|
|
standard <acronym>btree</acronym> (multi-way balanced tree) index data
|
|
structure. Any data type that can be sorted into a well-defined linear
|
|
order can be indexed by a btree index. The only limitation is that an
|
|
index entry cannot exceed approximately one-third of a page (after TOAST
|
|
compression, if applicable).
|
|
</para>
|
|
|
|
<para>
|
|
Because each btree operator class imposes a sort order on its data type,
|
|
btree operator classes (or, really, operator families) have come to be
|
|
used as <productname>PostgreSQL</productname>'s general representation
|
|
and understanding of sorting semantics. Therefore, they've acquired
|
|
some features that go beyond what would be needed just to support btree
|
|
indexes, and parts of the system that are quite distant from the
|
|
btree AM make use of them.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-behavior">
|
|
<title>Behavior of B-Tree Operator Classes</title>
|
|
|
|
<para>
|
|
As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator
|
|
class must provide five comparison operators,
|
|
<literal><</literal>,
|
|
<literal><=</literal>,
|
|
<literal>=</literal>,
|
|
<literal>>=</literal> and
|
|
<literal>></literal>.
|
|
One might expect that <literal><></literal> should also be part of
|
|
the operator class, but it is not, because it would almost never be
|
|
useful to use a <literal><></literal> WHERE clause in an index
|
|
search. (For some purposes, the planner treats <literal><></literal>
|
|
as associated with a btree operator class; but it finds that operator via
|
|
the <literal>=</literal> operator's negator link, rather than
|
|
from <structname>pg_amop</structname>.)
|
|
</para>
|
|
|
|
<para>
|
|
When several data types share near-identical sorting semantics, their
|
|
operator classes can be grouped into an operator family. Doing so is
|
|
advantageous because it allows the planner to make deductions about
|
|
cross-type comparisons. Each operator class within the family should
|
|
contain the single-type operators (and associated support functions)
|
|
for its input data type, while cross-type comparison operators and
|
|
support functions are <quote>loose</quote> in the family. It is
|
|
recommendable that a complete set of cross-type operators be included
|
|
in the family, thus ensuring that the planner can represent any
|
|
comparison conditions that it deduces from transitivity.
|
|
</para>
|
|
|
|
<para>
|
|
There are some basic assumptions that a btree operator family must
|
|
satisfy:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
An <literal>=</literal> operator must be an equivalence relation; that
|
|
is, for all non-null values <replaceable>A</replaceable>,
|
|
<replaceable>B</replaceable>, <replaceable>C</replaceable> of the
|
|
data type:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>A</replaceable> is true
|
|
(<firstterm>reflexive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>B</replaceable>,
|
|
then <replaceable>B</replaceable> <literal>=</literal>
|
|
<replaceable>A</replaceable>
|
|
(<firstterm>symmetric law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>B</replaceable> and <replaceable>B</replaceable>
|
|
<literal>=</literal> <replaceable>C</replaceable>,
|
|
then <replaceable>A</replaceable> <literal>=</literal>
|
|
<replaceable>C</replaceable>
|
|
(<firstterm>transitive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
A <literal><</literal> operator must be a strong ordering relation;
|
|
that is, for all non-null values <replaceable>A</replaceable>,
|
|
<replaceable>B</replaceable>, <replaceable>C</replaceable>:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>A</replaceable> is false
|
|
(<firstterm>irreflexive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>B</replaceable>
|
|
and <replaceable>B</replaceable> <literal><</literal>
|
|
<replaceable>C</replaceable>,
|
|
then <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>C</replaceable>
|
|
(<firstterm>transitive law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Furthermore, the ordering is total; that is, for all non-null
|
|
values <replaceable>A</replaceable>, <replaceable>B</replaceable>:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
exactly one of <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>B</replaceable>, <replaceable>A</replaceable>
|
|
<literal>=</literal> <replaceable>B</replaceable>, and
|
|
<replaceable>B</replaceable> <literal><</literal>
|
|
<replaceable>A</replaceable> is true
|
|
(<firstterm>trichotomy law</firstterm>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
(The trichotomy law justifies the definition of the comparison support
|
|
function, of course.)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
The other three operators are defined in terms of <literal>=</literal>
|
|
and <literal><</literal> in the obvious way, and must act consistently
|
|
with them.
|
|
</para>
|
|
|
|
<para>
|
|
For an operator family supporting multiple data types, the above laws must
|
|
hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>,
|
|
<replaceable>C</replaceable> are taken from any data types in the family.
|
|
The transitive laws are the trickiest to ensure, as in cross-type
|
|
situations they represent statements that the behaviors of two or three
|
|
different operators are consistent.
|
|
As an example, it would not work to put <type>float8</type>
|
|
and <type>numeric</type> into the same operator family, at least not with
|
|
the current semantics that <type>numeric</type> values are converted
|
|
to <type>float8</type> for comparison to a <type>float8</type>. Because
|
|
of the limited accuracy of <type>float8</type>, this means there are
|
|
distinct <type>numeric</type> values that will compare equal to the
|
|
same <type>float8</type> value, and thus the transitive law would fail.
|
|
</para>
|
|
|
|
<para>
|
|
Another requirement for a multiple-data-type family is that any implicit
|
|
or binary-coercion casts that are defined between data types included in
|
|
the operator family must not change the associated sort ordering.
|
|
</para>
|
|
|
|
<para>
|
|
It should be fairly clear why a btree index requires these laws to hold
|
|
within a single data type: without them there is no ordering to arrange
|
|
the keys with. Also, index searches using a comparison key of a
|
|
different data type require comparisons to behave sanely across two
|
|
data types. The extensions to three or more data types within a family
|
|
are not strictly required by the btree index mechanism itself, but the
|
|
planner relies on them for optimization purposes.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-support-funcs">
|
|
<title>B-Tree Support Functions</title>
|
|
|
|
<para>
|
|
As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
|
|
one required and two optional support functions.
|
|
</para>
|
|
|
|
<para>
|
|
For each combination of data types that a btree operator family provides
|
|
comparison operators for, it must provide a comparison support function,
|
|
registered in <structname>pg_amproc</structname> with support function
|
|
number 1 and
|
|
<structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield>
|
|
equal to the left and right data types for the comparison (i.e., the
|
|
same data types that the matching operators are registered with
|
|
in <structname>pg_amop</structname>).
|
|
The comparison function must take two non-null values
|
|
<replaceable>A</replaceable> and <replaceable>B</replaceable> and
|
|
return an <type>int32</type> value that
|
|
is <literal><</literal> <literal>0</literal>, <literal>0</literal>,
|
|
or <literal>></literal> <literal>0</literal>
|
|
when <replaceable>A</replaceable> <literal><</literal>
|
|
<replaceable>B</replaceable>, <replaceable>A</replaceable>
|
|
<literal>=</literal> <replaceable>B</replaceable>,
|
|
or <replaceable>A</replaceable> <literal>></literal>
|
|
<replaceable>B</replaceable>, respectively.
|
|
A null result is disallowed: all values of the data type must be comparable.
|
|
See <filename>src/backend/access/nbtree/nbtcompare.c</filename> for
|
|
examples.
|
|
</para>
|
|
|
|
<para>
|
|
If the compared values are of a collatable data type, the appropriate
|
|
collation OID will be passed to the comparison support function, using
|
|
the standard <function>PG_GET_COLLATION()</function> mechanism.
|
|
</para>
|
|
|
|
<para>
|
|
Optionally, a btree operator family may provide <firstterm>sort
|
|
support</firstterm> function(s), registered under support function number
|
|
2. These functions allow implementing comparisons for sorting purposes
|
|
in a more efficient way than naively calling the comparison support
|
|
function. The APIs involved in this are defined in
|
|
<filename>src/include/utils/sortsupport.h</filename>.
|
|
</para>
|
|
|
|
<indexterm>
|
|
<primary>in_range support functions</primary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>support functions</primary>
|
|
<secondary>in_range</secondary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Optionally, a btree operator family may
|
|
provide <firstterm>in_range</firstterm> support function(s), registered
|
|
under support function number 3. These are not used during btree index
|
|
operations; rather, they extend the semantics of the operator family so
|
|
that it can support window clauses containing
|
|
the <literal>RANGE</literal> <replaceable>offset</replaceable>
|
|
<literal>PRECEDING</literal>
|
|
and <literal>RANGE</literal> <replaceable>offset</replaceable>
|
|
<literal>FOLLOWING</literal> frame bound types (see
|
|
<xref linkend="syntax-window-functions"/>). Fundamentally, the extra
|
|
information provided is how to add or subtract
|
|
an <replaceable>offset</replaceable> value in a way that is compatible
|
|
with the family's data ordering.
|
|
</para>
|
|
|
|
<para>
|
|
An <function>in_range</function> function must have the signature
|
|
<synopsis>
|
|
in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool)
|
|
returns bool
|
|
</synopsis>
|
|
<replaceable>val</replaceable> and <replaceable>base</replaceable> must be
|
|
of the same type, which is one of the types supported by the operator
|
|
family (i.e., a type for which it provides an ordering).
|
|
However, <replaceable>offset</replaceable> could be of a different type,
|
|
which might be one otherwise unsupported by the family. An example is
|
|
that the built-in <literal>time_ops</literal> family provides
|
|
an <function>in_range</function> function that
|
|
has <replaceable>offset</replaceable> of type <type>interval</type>.
|
|
A family can provide <function>in_range</function> functions for any of
|
|
its supported types and one or more <replaceable>offset</replaceable>
|
|
types. Each <function>in_range</function> function should be entered
|
|
in <structname>pg_amproc</structname>
|
|
with <structfield>amproclefttype</structfield> equal to <type>type1</type>
|
|
and <structfield>amprocrighttype</structfield> equal to <type>type2</type>.
|
|
</para>
|
|
|
|
<para>
|
|
The essential semantics of an <function>in_range</function> function
|
|
depend on the two Boolean flag parameters. It should add or
|
|
subtract <replaceable>base</replaceable>
|
|
and <replaceable>offset</replaceable>, then
|
|
compare <replaceable>val</replaceable> to the result, as follows:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
if <literal>!</literal><replaceable>sub</replaceable> and
|
|
<literal>!</literal><replaceable>less</replaceable>,
|
|
return <replaceable>val</replaceable> <literal>>=</literal>
|
|
(<replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <literal>!</literal><replaceable>sub</replaceable>
|
|
and <replaceable>less</replaceable>,
|
|
return <replaceable>val</replaceable> <literal><=</literal>
|
|
(<replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>sub</replaceable>
|
|
and <literal>!</literal><replaceable>less</replaceable>,
|
|
return <replaceable>val</replaceable> <literal>>=</literal>
|
|
(<replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
if <replaceable>sub</replaceable> and <replaceable>less</replaceable>,
|
|
return <replaceable>val</replaceable> <literal><=</literal>
|
|
(<replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable>)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
Before doing so, the function should check the sign
|
|
of <replaceable>offset</replaceable>: if it is less than zero, raise
|
|
error <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal> (22013)
|
|
with error text like <quote>invalid preceding or following size in window
|
|
function</quote>. (This is required by the SQL standard, although
|
|
nonstandard operator families might perhaps choose to ignore this
|
|
restriction, since there seems to be little semantic necessity for it.)
|
|
This requirement is delegated to the <function>in_range</function>
|
|
function so that the core code needn't understand what <quote>less than
|
|
zero</quote> means for a particular data type.
|
|
</para>
|
|
|
|
<para>
|
|
An additional expectation is that <function>in_range</function> functions
|
|
should, if practical, avoid throwing an error
|
|
if <replaceable>base</replaceable> <literal>+</literal>
|
|
<replaceable>offset</replaceable>
|
|
or <replaceable>base</replaceable> <literal>-</literal>
|
|
<replaceable>offset</replaceable> would overflow.
|
|
The correct comparison result can be determined even if that value would
|
|
be out of the data type's range. Note that if the data type includes
|
|
concepts such as <quote>infinity</quote> or <quote>NaN</quote>, extra care
|
|
may be needed to ensure that <function>in_range</function>'s results agree
|
|
with the normal sort order of the operator family.
|
|
</para>
|
|
|
|
<para>
|
|
The results of the <function>in_range</function> function must be
|
|
consistent with the sort ordering imposed by the operator family.
|
|
To be precise, given any fixed values of <replaceable>offset</replaceable>
|
|
and <replaceable>sub</replaceable>, then:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with <replaceable>less</replaceable> =
|
|
true is true for some <replaceable>val1</replaceable>
|
|
and <replaceable>base</replaceable>, it must be true for
|
|
every <replaceable>val2</replaceable> <literal><=</literal>
|
|
<replaceable>val1</replaceable> with the
|
|
same <replaceable>base</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with <replaceable>less</replaceable> =
|
|
true is false for some <replaceable>val1</replaceable>
|
|
and <replaceable>base</replaceable>, it must be false for
|
|
every <replaceable>val2</replaceable> <literal>>=</literal>
|
|
<replaceable>val1</replaceable> with the
|
|
same <replaceable>base</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with <replaceable>less</replaceable> =
|
|
true is true for some <replaceable>val</replaceable>
|
|
and <replaceable>base1</replaceable>, it must be true for
|
|
every <replaceable>base2</replaceable> <literal>>=</literal>
|
|
<replaceable>base1</replaceable> with the
|
|
same <replaceable>val</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
If <function>in_range</function> with <replaceable>less</replaceable> =
|
|
true is false for some <replaceable>val</replaceable>
|
|
and <replaceable>base1</replaceable>, it must be false for
|
|
every <replaceable>base2</replaceable> <literal><=</literal>
|
|
<replaceable>base1</replaceable> with the
|
|
same <replaceable>val</replaceable>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
Analogous statements with inverted conditions hold
|
|
when <replaceable>less</replaceable> = false.
|
|
</para>
|
|
|
|
<para>
|
|
If the type being ordered (<type>type1</type>) is collatable,
|
|
the appropriate collation OID will be passed to
|
|
the <function>in_range</function> function, using the standard
|
|
PG_GET_COLLATION() mechanism.
|
|
</para>
|
|
|
|
<para>
|
|
<function>in_range</function> functions need not handle NULL inputs, and
|
|
typically will be marked strict.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="btree-implementation">
|
|
<title>Implementation</title>
|
|
|
|
<para>
|
|
An introduction to the btree index implementation can be found in
|
|
<filename>src/backend/access/nbtree/README</filename>.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|