mirror of
https://github.com/postgres/postgres.git
synced 2025-07-28 23:42:10 +03:00
Extend GIN to support partial-match searches, and extend tsquery to support
prefix matching using this facility. Teodor Sigaev and Oleg Bartunov
This commit is contained in:
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ -->
|
||||
|
||||
<chapter id="datatype">
|
||||
<title id="datatype-title">Data Types</title>
|
||||
@ -3298,18 +3298,17 @@ SELECT * FROM test;
|
||||
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
|
||||
tsvector
|
||||
----------------------------------------------------
|
||||
'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
|
||||
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
|
||||
</programlisting>
|
||||
|
||||
(As the example shows, the sorting is first by length and then
|
||||
alphabetically, but that detail is seldom important.) To represent
|
||||
To represent
|
||||
lexemes containing whitespace or punctuation, surround them with quotes:
|
||||
|
||||
<programlisting>
|
||||
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
|
||||
tsvector
|
||||
-------------------------------------------
|
||||
'the' ' ' 'lexeme' 'spaces' 'contains'
|
||||
' ' 'contains' 'lexeme' 'spaces' 'the'
|
||||
</programlisting>
|
||||
|
||||
(We use dollar-quoted string literals in this example and the next one,
|
||||
@ -3320,7 +3319,7 @@ SELECT $$the lexeme ' ' contains spaces$$::tsvector;
|
||||
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
|
||||
tsvector
|
||||
------------------------------------------------
|
||||
'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
|
||||
'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
|
||||
</programlisting>
|
||||
|
||||
Optionally, integer <firstterm>position(s)</>
|
||||
@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
|
||||
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
|
||||
tsvector
|
||||
-------------------------------------------------------------------------------
|
||||
'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
|
||||
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
|
||||
</programlisting>
|
||||
|
||||
A position normally indicates the source word's location in the
|
||||
@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
|
||||
select 'The Fat Rats'::tsvector;
|
||||
tsvector
|
||||
--------------------
|
||||
'Fat' 'The' 'Rats'
|
||||
'Fat' 'Rats' 'The'
|
||||
</programlisting>
|
||||
|
||||
For most English-text-searching applications the above words would
|
||||
@ -3439,6 +3438,19 @@ SELECT 'fat:ab & cat'::tsquery;
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</>
|
||||
to specify prefix matching:
|
||||
<programlisting>
|
||||
SELECT 'super:*'::tsquery;
|
||||
tsquery
|
||||
-----------
|
||||
'super':*
|
||||
</programlisting>
|
||||
This query will match any word in a <type>tsvector</> that begins
|
||||
with <quote>super</>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Quoting rules for lexemes are the same as described above for
|
||||
lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ -->
|
||||
|
||||
<chapter id="GIN">
|
||||
<title>GIN Indexes</title>
|
||||
@ -52,15 +52,15 @@
|
||||
</para>
|
||||
|
||||
<para>
|
||||
All it takes to get a <acronym>GIN</acronym> access method working
|
||||
is to implement four user-defined methods, which define the behavior of
|
||||
All it takes to get a <acronym>GIN</acronym> access method working is to
|
||||
implement four (or five) user-defined methods, which define the behavior of
|
||||
keys in the tree and the relationships between keys, indexed values,
|
||||
and indexable queries. In short, <acronym>GIN</acronym> combines
|
||||
extensibility with generality, code reuse, and a clean interface.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The four methods that an index operator class for
|
||||
The four methods that an operator class for
|
||||
<acronym>GIN</acronym> must provide are:
|
||||
</para>
|
||||
|
||||
@ -77,7 +77,7 @@
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Datum* extractValue(Datum inputValue, int32 *nkeys)</term>
|
||||
<term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Returns an array of keys given a value to be indexed. The
|
||||
@ -87,8 +87,8 @@
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term>Datum* extractQuery(Datum query, int32 *nkeys,
|
||||
StrategyNumber n)</term>
|
||||
<term>Datum *extractQuery(Datum query, int32 *nkeys,
|
||||
StrategyNumber n, bool **pmatch)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Returns an array of keys given a value to be queried; that is,
|
||||
@ -100,13 +100,22 @@
|
||||
to consult <literal>n</> to determine the data type of
|
||||
<literal>query</> and the key values that need to be extracted.
|
||||
The number of returned keys must be stored into <literal>*nkeys</>.
|
||||
If number of keys is equal to zero then <function>extractQuery</>
|
||||
should store 0 or -1 into <literal>*nkeys</>. 0 means that any
|
||||
row matches the <literal>query</> and sequence scan should be
|
||||
produced. -1 means nothing can satisfy <literal>query</>.
|
||||
Choice of value should be based on semantics meaning of operation with
|
||||
given strategy number.
|
||||
If the query contains no keys then <function>extractQuery</>
|
||||
should store 0 or -1 into <literal>*nkeys</>, depending on the
|
||||
semantics of the operator. 0 means that every
|
||||
value matches the <literal>query</> and a sequential scan should be
|
||||
produced. -1 means nothing can match the <literal>query</>.
|
||||
<literal>pmatch</> is an output argument for use when partial match
|
||||
is supported. To use it, <function>extractQuery</> must allocate
|
||||
an array of <literal>*nkeys</> booleans and store its address at
|
||||
<literal>*pmatch</>. Each element of the array should be set to TRUE
|
||||
if the corresponding key requires partial match, FALSE if not.
|
||||
If <literal>*pmatch</> is set to NULL then GIN assumes partial match
|
||||
is not required. The variable is initialized to NULL before call,
|
||||
so this argument can simply be ignored by operator classes that do
|
||||
not support partial match.
|
||||
</para>
|
||||
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
@ -133,6 +142,39 @@
|
||||
|
||||
</variablelist>
|
||||
|
||||
<para>
|
||||
Optionally, an operator class for
|
||||
<acronym>GIN</acronym> can supply a fifth method:
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
||||
<varlistentry>
|
||||
<term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Compare a partial-match query to an index key. Returns an integer
|
||||
whose sign indicates the result: less than zero means the index key
|
||||
does not match the query, but the index scan should continue; zero
|
||||
means that the index key does match the query; greater than zero
|
||||
indicates that the index scan should stop because no more matches
|
||||
are possible. The strategy number <literal>n</> of the operator
|
||||
that generated the partial match query is provided, in case its
|
||||
semantics are needed to determine when to end the scan.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
</variablelist>
|
||||
|
||||
<para>
|
||||
To support <quote>partial match</> queries, an operator class must
|
||||
provide the <function>comparePartial</> method, and its
|
||||
<function>extractQuery</> method must set the <literal>pmatch</>
|
||||
parameter when a partial-match query is encountered. See
|
||||
<xref linkend="gin-partial-match"> for details.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="gin-implementation">
|
||||
@ -146,6 +188,33 @@
|
||||
list of heap pointers (PL, posting list) if the list is small enough.
|
||||
</para>
|
||||
|
||||
<sect2 id="gin-partial-match">
|
||||
<title>Partial match algorithm</title>
|
||||
|
||||
<para>
|
||||
GIN can support <quote>partial match</> queries, in which the query
|
||||
does not determine an exact match for one or more keys, but the possible
|
||||
matches fall within a reasonably narrow range of key values (within the
|
||||
key sorting order determined by the <function>compare</> support method).
|
||||
The <function>extractQuery</> method, instead of returning a key value
|
||||
to be matched exactly, returns a key value that is the lower bound of
|
||||
the range to be searched, and sets the <literal>pmatch</> flag true.
|
||||
The key range is then searched using the <function>comparePartial</>
|
||||
method. <function>comparePartial</> must return zero for an actual
|
||||
match, less than zero for a non-match that is still within the range
|
||||
to be searched, or greater than zero if the index key is past the range
|
||||
that could match.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
During a partial-match scan, all <literal>itemPointer</>s for matching keys
|
||||
are OR'ed into a <literal>TIDBitmap</>.
|
||||
The scan fails if the <literal>TIDBitmap</> becomes lossy.
|
||||
In this case an error message will be reported with advice
|
||||
to increase <literal>work_mem</>.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<sect1 id="gin-tips">
|
||||
@ -236,8 +305,14 @@
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<acronym>GIN</acronym> searches keys only by equality matching. This might
|
||||
be improved in future.
|
||||
It is possible for an operator class to circumvent the restriction against
|
||||
full index scan. To do that, <function>extractValue</> must return at least
|
||||
one (possibly dummy) key for every indexed value, and
|
||||
<function>extractQuery</function> must convert an unrestricted search into
|
||||
a partial-match query that will scan the whole index. This is inefficient
|
||||
but might be necessary to avoid corner-case failures with operators such
|
||||
as LIKE. Note however that failure could still occur if the intermediate
|
||||
<literal>TIDBitmap</> becomes lossy.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
@ -247,9 +322,11 @@
|
||||
<para>
|
||||
The <productname>PostgreSQL</productname> source distribution includes
|
||||
<acronym>GIN</acronym> operator classes for <type>tsvector</> and
|
||||
for one-dimensional arrays of all internal types. The following
|
||||
<filename>contrib</> modules also contain <acronym>GIN</acronym>
|
||||
operator classes:
|
||||
for one-dimensional arrays of all internal types. Prefix searching in
|
||||
<type>tsvector</> is implemented using the <acronym>GIN</> partial match
|
||||
feature.
|
||||
The following <filename>contrib</> modules also contain
|
||||
<acronym>GIN</acronym> operator classes:
|
||||
</para>
|
||||
|
||||
<variablelist>
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ -->
|
||||
|
||||
<chapter id="textsearch">
|
||||
<title id="textsearch-title">Full Text Search</title>
|
||||
@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB');
|
||||
'fat' | 'rat':AB
|
||||
</programlisting>
|
||||
|
||||
Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
|
||||
|
||||
<programlisting>
|
||||
SELECT to_tsquery('supern:*A & star:A*B');
|
||||
to_tsquery
|
||||
--------------------------
|
||||
'supern':*A & 'star':*AB
|
||||
</programlisting>
|
||||
|
||||
Such a lexeme will match any word in a <type>tsvector</> that begins
|
||||
with the given string.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<function>to_tsquery</function> can also accept single-quoted
|
||||
phrases. This is primarily useful when the configuration includes a
|
||||
thesaurus dictionary that may trigger on such phrases.
|
||||
@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' & !crab');
|
||||
</programlisting>
|
||||
|
||||
Note that <function>plainto_tsquery</> cannot
|
||||
recognize either Boolean operators or weight labels in its input:
|
||||
recognize Boolean operators, weight labels, or prefix-match labels
|
||||
in its input:
|
||||
|
||||
<programlisting>
|
||||
SELECT plainto_tsquery('english', 'The Fat & Rats:C');
|
||||
|
@ -1,4 +1,4 @@
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
|
||||
<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ -->
|
||||
|
||||
<sect1 id="xindex">
|
||||
<title>Interfacing Extensions To Indexes</title>
|
||||
@ -444,6 +444,13 @@
|
||||
<entry>consistent - determine whether value matches query condition</entry>
|
||||
<entry>4</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>comparePartial - (optional method) compare partial key from
|
||||
query and key from index, and return an integer less than zero, zero,
|
||||
or greater than zero, indicating whether GIN should ignore this index
|
||||
entry, treat the entry as a match, or stop the index scan</entry>
|
||||
<entry>5</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
Reference in New Issue
Block a user