1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

Push index operator lossiness determination down to GIST/GIN opclass

"consistent" functions, and remove pg_amop.opreqcheck, as per recent
discussion.  The main immediate benefit of this is that we no longer need
8.3's ugly hack of requiring @@@ rather than @@ to test weight-using tsquery
searches on GIN indexes.  In future it should be possible to optimize some
other queries better than is done now, by detecting at runtime whether the
index match is exact or not.

Tom Lane, after an idea of Heikki's, and with some help from Teodor.
This commit is contained in:
Tom Lane
2008-04-14 17:05:34 +00:00
parent 10be77c173
commit 9b5c8d45f6
68 changed files with 1023 additions and 785 deletions

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.164 2008/04/10 22:25:25 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/catalogs.sgml,v 2.165 2008/04/14 17:05:32 tgl Exp $ -->
<!--
Documentation of the system catalogs, directed toward PostgreSQL developers
-->
@ -606,13 +606,6 @@
<entry>Operator strategy number</entry>
</row>
<row>
<entry><structfield>amopreqcheck</structfield></entry>
<entry><type>bool</type></entry>
<entry></entry>
<entry>Index hit must be rechecked</entry>
</row>
<row>
<entry><structfield>amopopr</structfield></entry>
<entry><type>oid</type></entry>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.429 2008/04/10 13:34:33 alvherre Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.430 2008/04/14 17:05:32 tgl Exp $ -->
<chapter id="functions">
<title>Functions and Operators</title>
@ -7738,7 +7738,7 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
</row>
<row>
<entry> <literal>@@@</literal> </entry>
<entry>same as <literal>@@</>, but see <xref linkend="textsearch-indexes"></entry>
<entry>deprecated synonym for <literal>@@</></entry>
<entry><literal>to_tsvector('fat cats ate rats') @@@ to_tsquery('cat &amp; rat')</literal></entry>
<entry><literal>t</literal></entry>
</row>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.13 2007/11/16 03:23:07 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
<chapter id="GIN">
<title>GIN Indexes</title>
@ -111,12 +111,12 @@
</varlistentry>
<varlistentry>
<term>bool consistent(bool check[], StrategyNumber n, Datum query)</term>
<term>bool consistent(bool check[], StrategyNumber n, Datum query, bool *recheck)</term>
<listitem>
<para>
Returns TRUE if the indexed value satisfies the query operator with
strategy number <literal>n</> (or would satisfy, if the operator is
marked RECHECK in the operator class). The <literal>check</> array has
strategy number <literal>n</> (or might satisfy, if the recheck
indication is returned). The <literal>check</> array has
the same length as the number of keys previously returned by
<function>extractQuery</> for this query. Each element of the
<literal>check</> array is TRUE if the indexed value contains the
@ -124,6 +124,9 @@
<function>extractQuery</> result array is present in the indexed value.
The original <literal>query</> datum (not the extracted key array!) is
passed in case the <function>consistent</> method needs to consult it.
On success, <literal>*recheck</> should be set to TRUE if the heap
tuple needs to be rechecked against the query operator, or FALSE if
the index test is exact.
</para>
</listitem>
</varlistentry>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/gist.sgml,v 1.29 2007/11/13 23:36:26 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/gist.sgml,v 1.30 2008/04/14 17:05:32 tgl Exp $ -->
<chapter id="GiST">
<title>GiST Indexes</title>
@ -103,7 +103,10 @@
Given a predicate <literal>p</literal> on a tree page, and a user
query, <literal>q</literal>, this method will return false if it is
certain that both <literal>p</literal> and <literal>q</literal> cannot
be true for a given data item.
be true for a given data item. For a true result, a
<literal>recheck</> flag must also be returned; this indicates whether
the predicate implies the query (<literal>recheck</> = false) or
not (<literal>recheck</> = true).
</para>
</listitem>
</varlistentry>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.25 2008/04/13 19:18:13 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.26 2008/04/14 17:05:32 tgl Exp $ -->
<chapter id="indexam">
<title>Index Access Method Interface Definition</title>
@ -183,7 +183,7 @@ aminsert (Relation indexRelation,
parameter. See <xref linkend="index-unique-checks"> for details.
The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
result does not denote an error condition, but is used for cases such
as an index AM refusing to index a NULL.)
as an index method refusing to index a NULL.)
</para>
<para>
@ -430,13 +430,13 @@ amrestrpos (IndexScanDesc scan);
</para>
<para>
The operator family can indicate that the index is <firstterm>lossy</> for a
particular operator; this implies that the index scan will return all the
entries that pass the scan key, plus possibly additional entries that do
not. The core system's index-scan machinery will then apply that operator
again to the heap tuple to verify whether or not it really should be
selected. For non-lossy operators, the index scan must return exactly the
set of matching entries, as there is no recheck.
The access method can report that the index is <firstterm>lossy</>, or
requires rechecks, for a particular query. This implies that the index
scan will return all the entries that pass the scan key, plus possibly
additional entries that do not. The core system's index-scan machinery
will then apply the index conditions again to the heap tuple to verify
whether or not it really should be selected. If the recheck option is not
specified, the index scan must return exactly the set of matching entries.
</para>
<para>
@ -849,7 +849,7 @@ amcostestimate (PlannerInfo *root,
<para>
The indexSelectivity should be set to the estimated fraction of the parent
table rows that will be retrieved during the index scan. In the case
of a lossy index, this will typically be higher than the fraction of
of a lossy query, this will typically be higher than the fraction of
rows that actually pass the given qual conditions.
</para>

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_opfamily.sgml,v 1.3 2007/02/14 04:30:26 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/ref/alter_opfamily.sgml,v 1.4 2008/04/14 17:05:32 tgl Exp $
PostgreSQL documentation
-->
@ -21,7 +21,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="parameter">index_method</replaceable> ADD
{ OPERATOR <replaceable class="parameter">strategy_number</replaceable> <replaceable class="parameter">operator_name</replaceable> ( <replaceable class="parameter">op_type</replaceable>, <replaceable class="parameter">op_type</replaceable> ) [ RECHECK ]
{ OPERATOR <replaceable class="parameter">strategy_number</replaceable> <replaceable class="parameter">operator_name</replaceable> ( <replaceable class="parameter">op_type</replaceable>, <replaceable class="parameter">op_type</replaceable> )
| FUNCTION <replaceable class="parameter">support_number</replaceable> [ ( <replaceable class="parameter">op_type</replaceable> [ , <replaceable class="parameter">op_type</replaceable> ] ) ] <replaceable class="parameter">funcname</replaceable> ( <replaceable class="parameter">argument_type</replaceable> [, ...] )
} [, ... ]
ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="parameter">index_method</replaceable> DROP
@ -154,18 +154,6 @@ ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="
</listitem>
</varlistentry>
<varlistentry>
<term><literal>RECHECK</></term>
<listitem>
<para>
If present, the index is <quote>lossy</> for this operator, and
so the rows retrieved using the index must be rechecked to
verify that they actually satisfy the qualification clause
involving this operator.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable class="parameter">support_number</replaceable></term>
<listitem>
@ -247,6 +235,14 @@ ALTER OPERATOR FAMILY <replaceable>name</replaceable> USING <replaceable class="
is likely to be inlined into the calling query, which will prevent
the optimizer from recognizing that the query matches an index.
</para>
<para>
Before <productname>PostgreSQL</productname> 8.4, the <literal>OPERATOR</>
clause could include a <literal>RECHECK</> option. This is no longer
supported because whether an index operator is <quote>lossy</> is now
determined on-the-fly at runtime. This allows efficient handling of
cases where an operator might or might not be lossy.
</para>
</refsect1>
<refsect1>

View File

@ -1,5 +1,5 @@
<!--
$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.21 2007/12/03 23:49:51 tgl Exp $
$PostgreSQL: pgsql/doc/src/sgml/ref/create_opclass.sgml,v 1.22 2008/04/14 17:05:32 tgl Exp $
PostgreSQL documentation
-->
@ -22,7 +22,7 @@ PostgreSQL documentation
<synopsis>
CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAULT ] FOR TYPE <replaceable class="parameter">data_type</replaceable>
USING <replaceable class="parameter">index_method</replaceable> [ FAMILY <replaceable class="parameter">family_name</replaceable> ] AS
{ OPERATOR <replaceable class="parameter">strategy_number</replaceable> <replaceable class="parameter">operator_name</replaceable> [ ( <replaceable class="parameter">op_type</replaceable>, <replaceable class="parameter">op_type</replaceable> ) ] [ RECHECK ]
{ OPERATOR <replaceable class="parameter">strategy_number</replaceable> <replaceable class="parameter">operator_name</replaceable> [ ( <replaceable class="parameter">op_type</replaceable>, <replaceable class="parameter">op_type</replaceable> ) ]
| FUNCTION <replaceable class="parameter">support_number</replaceable> [ ( <replaceable class="parameter">op_type</replaceable> [ , <replaceable class="parameter">op_type</replaceable> ] ) ] <replaceable class="parameter">funcname</replaceable> ( <replaceable class="parameter">argument_type</replaceable> [, ...] )
| STORAGE <replaceable class="parameter">storage_type</replaceable>
} [, ... ]
@ -179,18 +179,6 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
</listitem>
</varlistentry>
<varlistentry>
<term><literal>RECHECK</></term>
<listitem>
<para>
If present, the index is <quote>lossy</> for this operator, and
so the rows retrieved using the index must be rechecked to
verify that they actually satisfy the qualification clause
involving this operator.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable class="parameter">support_number</replaceable></term>
<listitem>
@ -256,6 +244,14 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
is likely to be inlined into the calling query, which will prevent
the optimizer from recognizing that the query matches an index.
</para>
<para>
Before <productname>PostgreSQL</productname> 8.4, the <literal>OPERATOR</>
clause could include a <literal>RECHECK</> option. This is no longer
supported because whether an index operator is <quote>lossy</> is now
determined on-the-fly at runtime. This allows efficient handling of
cases where an operator might or might not be lossy.
</para>
</refsect1>
<refsect1>
@ -271,12 +267,12 @@ CREATE OPERATOR CLASS <replaceable class="parameter">name</replaceable> [ DEFAUL
CREATE OPERATOR CLASS gist__int_ops
DEFAULT FOR TYPE _int4 USING gist AS
OPERATOR 3 &amp;&amp;,
OPERATOR 6 = RECHECK,
OPERATOR 6 = (anyarray, anyarray),
OPERATOR 7 @&gt;,
OPERATOR 8 &lt;@,
OPERATOR 20 @@ (_int4, query_int),
FUNCTION 1 g_int_consistent (internal, _int4, int4),
FUNCTION 2 g_int_union (bytea, internal),
FUNCTION 1 g_int_consistent (internal, _int4, int, oid, internal),
FUNCTION 2 g_int_union (internal, internal),
FUNCTION 3 g_int_compress (internal),
FUNCTION 4 g_int_decompress (internal),
FUNCTION 5 g_int_penalty (internal, internal, internal),

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.42 2008/03/10 03:01:28 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
<chapter id="textsearch">
<title id="textsearch-title">Full Text Search</title>
@ -3142,19 +3142,7 @@ SELECT plainto_tsquery('supernovae stars');
A GiST index is <firstterm>lossy</firstterm>, meaning that the index
may produce false matches, and it is necessary
to check the actual table row to eliminate such false matches.
<productname>PostgreSQL</productname> does this automatically; for
example, in the query plan below, the <literal>Filter:</literal>
line indicates the index output will be rechecked:
<programlisting>
EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using textsearch_gidx on apod (cost=0.00..12.29 rows=2 width=1469)
Index Cond: (textsearch @@ '''supernova'''::tsquery)
Filter: (textsearch @@ '''supernova'''::tsquery)
</programlisting>
(<productname>PostgreSQL</productname> does this automatically when needed.)
GiST indexes are lossy because each document is represented in the
index by a fixed-length signature. The signature is generated by hashing
each word into a random bit in an n-bit string, with all these bits OR-ed
@ -3174,57 +3162,11 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
</para>
<para>
GIN indexes are not lossy but their performance depends logarithmically on
the number of unique words.
</para>
<para>
Actually, GIN indexes store only the words (lexemes) of <type>tsvector</>
values, and not their weight labels. Thus, while a GIN index can be
considered non-lossy for a query that does not specify weights, it is
lossy for one that does. Thus a table row recheck is needed when using
a query that involves weights. Unfortunately, in the current design of
<productname>PostgreSQL</>, whether a recheck is needed is a static
property of a particular operator, and not something that can be enabled
or disabled on-the-fly depending on the values given to the operator.
To deal with this situation without imposing the overhead of rechecks
on queries that do not need them, the following approach has been
adopted:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
The standard text match operator <literal>@@</> is marked as non-lossy
for GIN indexes.
</para>
</listitem>
<listitem>
<para>
An additional match operator <literal>@@@</> is provided, and marked
as lossy for GIN indexes. This operator behaves exactly like
<literal>@@</> otherwise.
</para>
</listitem>
<listitem>
<para>
When a GIN index search is initiated with the <literal>@@</> operator,
the index support code will throw an error if the query specifies any
weights. This protects against giving wrong answers due to failure
to recheck the weights.
</para>
</listitem>
</itemizedlist>
<para>
In short, you must use <literal>@@@</> rather than <literal>@@</> to
perform GIN index searches on queries that involve weight restrictions.
For queries that do not have weight restrictions, either operator will
work, but <literal>@@</> will be faster.
This awkwardness will probably be addressed in a future release of
<productname>PostgreSQL</>.
GIN indexes are not lossy for standard queries, but their performance
depends logarithmically on the number of unique words.
(However, GIN indexes store only the words (lexemes) of <type>tsvector</>
values, and not their weight labels. Thus a table row recheck is needed
when using a query that involves weights.)
</para>
<para>

View File

@ -1,4 +1,4 @@
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.61 2007/12/02 04:36:40 tgl Exp $ -->
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
<sect1 id="xindex">
<title>Interfacing Extensions To Indexes</title>
@ -913,26 +913,31 @@ ALTER OPERATOR FAMILY integer_ops USING btree ADD
<para>
Normally, declaring an operator as a member of an operator class
(or family) means
that the index method can retrieve exactly the set of rows
(or family) means that the index method can retrieve exactly the set of rows
that satisfy a <literal>WHERE</> condition using the operator. For example:
<programlisting>
SELECT * FROM table WHERE integer_column &lt; 4;
</programlisting>
can be satisfied exactly by a B-tree index on the integer column.
But there are cases where an index is useful as an inexact guide to
the matching rows. For example, if a GiST index stores only
bounding boxes for objects, then it cannot exactly satisfy a <literal>WHERE</>
the matching rows. For example, if a GiST index stores only bounding boxes
for geometric objects, then it cannot exactly satisfy a <literal>WHERE</>
condition that tests overlap between nonrectangular objects such as
polygons. Yet we could use the index to find objects whose bounding
box overlaps the bounding box of the target object, and then do the
exact overlap test only on the objects found by the index. If this
scenario applies, the index is said to be <quote>lossy</> for the
operator, and we add <literal>RECHECK</> to the <literal>OPERATOR</> clause
in the <command>CREATE OPERATOR CLASS</> command.
<literal>RECHECK</> is valid if the index is guaranteed to return
all the required rows, plus perhaps some additional rows, which
can be eliminated by performing the original operator invocation.
operator. Lossy index searches are implemented by having the index
method return a <firstterm>recheck</> flag when a row might or might
not really satisfy the query condition. The core system will then
test the original query condition on the retrieved row to see whether
it should be returned as a valid match. This approach works if
the index is guaranteed to return all the required rows, plus perhaps
some additional rows, which can be eliminated by performing the original
operator invocation. The index methods that support lossy searches
(currently, GiST and GIN) allow the support functions of individual
operator classes to set the recheck flag, and so this is essentially an
operator-class feature.
</para>
<para>