Extend GIN to support partial-match searches, and extend tsquery to support

prefix matching using this facility. Teodor Sigaev and Oleg Bartunov
2025-07-28 23:42:10 +03:00 · 2008-05-16 16:31:02 +00:00
parent e1bdd07c3c
commit e6dbcb72fa
32 changed files with 1284 additions and 508 deletions
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ -->

 <chapter id="datatype">
  <title id="datatype-title">Data Types</title>
@ -3298,18 +3298,17 @@ SELECT * FROM test;
 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
 ----------------------------------------------------
- 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
+ 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
 </programlisting>

-     (As the example shows, the sorting is first by length and then
-     alphabetically, but that detail is seldom important.)  To represent
+     To represent
     lexemes containing whitespace or punctuation, surround them with quotes:

 <programlisting>
 SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector                  
 -------------------------------------------
- 'the' '    ' 'lexeme' 'spaces' 'contains'
+ '    ' 'contains' 'lexeme' 'spaces' 'the'
 </programlisting>

     (We use dollar-quoted string literals in this example and the next one,
@ -3320,7 +3319,7 @@ SELECT $$the lexeme '    ' contains spaces$$::tsvector;
 SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector                    
 ------------------------------------------------
- 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
+ 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
 </programlisting>

     Optionally, integer <firstterm>position(s)</>
@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
 SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
 -------------------------------------------------------------------------------
- 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
+ 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
 </programlisting>

     A position normally indicates the source word's location in the
@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
 select 'The Fat Rats'::tsvector;
      tsvector      
 --------------------
- 'Fat' 'The' 'Rats'
+ 'Fat' 'Rats' 'The'
 </programlisting>

     For most English-text-searching applications the above words would
@ -3439,6 +3438,19 @@ SELECT 'fat:ab &amp; cat'::tsquery;
 </programlisting>
    </para>

+    <para>
+     Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</>
+     to specify prefix matching:
+<programlisting>
+SELECT 'super:*'::tsquery;
+  tsquery  
+-----------
+ 'super':*
+</programlisting>
+     This query will match any word in a <type>tsvector</> that begins
+     with <quote>super</>.
+    </para>
+
    <para>
     Quoting rules for lexemes are the same as described above for
     lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ -->

 <chapter id="GIN">
 <title>GIN Indexes</title>
@ -52,15 +52,15 @@
 </para>

 <para>
-   All it takes to get a <acronym>GIN</acronym> access method working
-   is to implement four user-defined methods, which define the behavior of
+   All it takes to get a <acronym>GIN</acronym> access method working is to
+   implement four (or five) user-defined methods, which define the behavior of
   keys in the tree and the relationships between keys, indexed values,
   and indexable queries. In short, <acronym>GIN</acronym> combines
   extensibility with generality, code reuse, and a clean interface.
 </para>

 <para>
-   The four methods that an index operator class for
+   The four methods that an operator class for
   <acronym>GIN</acronym> must provide are:
 </para>

@ -77,7 +77,7 @@
    </varlistentry>

    <varlistentry>
-     <term>Datum* extractValue(Datum inputValue, int32 *nkeys)</term>
+     <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
     <listitem>
      <para>
       Returns an array of keys given a value to be indexed.  The
@ -87,8 +87,8 @@
    </varlistentry>

    <varlistentry>
-     <term>Datum* extractQuery(Datum query, int32 *nkeys,
-        StrategyNumber n)</term>
+     <term>Datum *extractQuery(Datum query, int32 *nkeys,
+        StrategyNumber n, bool **pmatch)</term>
     <listitem>
      <para>
       Returns an array of keys given a value to be queried; that is,
@ -100,13 +100,22 @@
       to consult <literal>n</> to determine the data type of
       <literal>query</> and the key values that need to be extracted.
       The number of returned keys must be stored into <literal>*nkeys</>.
-       If number of keys is equal to zero then <function>extractQuery</> 
-       should store 0 or -1 into <literal>*nkeys</>. 0 means that any 
-       row matches the <literal>query</> and sequence scan should be 
-       produced. -1 means nothing can satisfy <literal>query</>. 
-       Choice of value should be based on semantics meaning of operation with 
-       given strategy number.
+       If the query contains no keys then <function>extractQuery</> 
+       should store 0 or -1 into <literal>*nkeys</>, depending on the
+       semantics of the operator.  0 means that every
+       value matches the <literal>query</> and a sequential scan should be 
+       produced.  -1 means nothing can match the <literal>query</>. 
+       <literal>pmatch</> is an output argument for use when partial match
+       is supported.  To use it, <function>extractQuery</> must allocate
+       an array of <literal>*nkeys</> booleans and store its address at
+       <literal>*pmatch</>.  Each element of the array should be set to TRUE
+       if the corresponding key requires partial match, FALSE if not.
+       If <literal>*pmatch</> is set to NULL then GIN assumes partial match
+       is not required.  The variable is initialized to NULL before call,
+       so this argument can simply be ignored by operator classes that do
+       not support partial match.
      </para>
+
     </listitem>
    </varlistentry>

@ -133,6 +142,39 @@

  </variablelist>

+ <para>
+  Optionally, an operator class for
+  <acronym>GIN</acronym> can supply a fifth method:
+ </para>
+
+  <variablelist>
+
+    <varlistentry>
+     <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
+     <listitem>
+      <para>
+       Compare a partial-match query to an index key.  Returns an integer
+       whose sign indicates the result: less than zero means the index key
+       does not match the query, but the index scan should continue; zero
+       means that the index key does match the query; greater than zero
+       indicates that the index scan should stop because no more matches
+       are possible.  The strategy number <literal>n</> of the operator
+       that generated the partial match query is provided, in case its
+       semantics are needed to determine when to end the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+  </variablelist>
+
+ <para>
+  To support <quote>partial match</> queries, an operator class must
+  provide the <function>comparePartial</> method, and its
+  <function>extractQuery</> method must set the <literal>pmatch</>
+  parameter when a partial-match query is encountered.  See
+  <xref linkend="gin-partial-match"> for details.
+ </para>
+
 </sect1>

 <sect1 id="gin-implementation">
@ -146,6 +188,33 @@
  list of heap pointers (PL, posting list) if the list is small enough.
 </para>

+ <sect2 id="gin-partial-match">
+  <title>Partial match algorithm</title>
+  
+  <para>
+   GIN can support <quote>partial match</> queries, in which the query
+   does not determine an exact match for one or more keys, but the possible
+   matches fall within a reasonably narrow range of key values (within the
+   key sorting order determined by the <function>compare</> support method).
+   The <function>extractQuery</> method, instead of returning a key value
+   to be matched exactly, returns a key value that is the lower bound of
+   the range to be searched, and sets the <literal>pmatch</> flag true.
+   The key range is then searched using the <function>comparePartial</>
+   method.  <function>comparePartial</> must return zero for an actual
+   match, less than zero for a non-match that is still within the range
+   to be searched, or greater than zero if the index key is past the range
+   that could match.
+  </para>
+
+  <para>
+   During a partial-match scan, all <literal>itemPointer</>s for matching keys
+   are OR'ed into a <literal>TIDBitmap</>.
+   The scan fails if the <literal>TIDBitmap</> becomes lossy.
+   In this case an error message will be reported with advice
+   to increase <literal>work_mem</>.
+  </para>
+ </sect2>
+
 </sect1>

 <sect1 id="gin-tips">
@ -236,8 +305,14 @@
 </para>

 <para>
-  <acronym>GIN</acronym> searches keys only by equality matching.  This might
-  be improved in future.
+  It is possible for an operator class to circumvent the restriction against
+  full index scan.  To do that, <function>extractValue</> must return at least
+  one (possibly dummy) key for every indexed value, and
+  <function>extractQuery</function> must convert an unrestricted search into
+  a partial-match query that will scan the whole index.  This is inefficient
+  but might be necessary to avoid corner-case failures with operators such
+  as LIKE.  Note however that failure could still occur if the intermediate
+  <literal>TIDBitmap</> becomes lossy.
 </para>
 </sect1>

@ -247,9 +322,11 @@
 <para>
  The <productname>PostgreSQL</productname> source distribution includes
  <acronym>GIN</acronym> operator classes for <type>tsvector</> and
-  for one-dimensional arrays of all internal types.  The following
-  <filename>contrib</> modules also contain <acronym>GIN</acronym>
-  operator classes:
+  for one-dimensional arrays of all internal types.  Prefix searching in
+  <type>tsvector</> is implemented using the <acronym>GIN</> partial match
+  feature.
+  The following <filename>contrib</> modules also contain
+  <acronym>GIN</acronym> operator classes:
 </para>

 <variablelist>
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ -->

 <chapter id="textsearch">
 <title id="textsearch-title">Full Text Search</title>
@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB');
 'fat' | 'rat':AB
 </programlisting>

+    Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
+
+<programlisting>
+SELECT to_tsquery('supern:*A &amp; star:A*B');
+        to_tsquery        
+--------------------------
+ 'supern':*A &amp; 'star':*AB
+</programlisting>
+
+    Such a lexeme will match any word in a <type>tsvector</> that begins
+    with the given string.
+   </para>
+
+   <para>
    <function>to_tsquery</function> can also accept single-quoted
    phrases.  This is primarily useful when the configuration includes a
    thesaurus dictionary that may trigger on such phrases.
@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' &amp; !crab');
 </programlisting>

    Note that <function>plainto_tsquery</> cannot
-    recognize either Boolean operators or weight labels in its input:
+    recognize Boolean operators, weight labels, or prefix-match labels
+    in its input:

 <programlisting>
 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ -->

 <sect1 id="xindex">
 <title>Interfacing Extensions To Indexes</title>
@ -444,6 +444,13 @@
       <entry>consistent - determine whether value matches query condition</entry>
       <entry>4</entry>
      </row>
+      <row>
+       <entry>comparePartial - (optional method) compare partial key from
+        query and key from index, and return an integer less than zero, zero,
+        or greater than zero, indicating whether GIN should ignore this index
+        entry, treat the entry as a match, or stop the index scan</entry>
+       <entry>5</entry>
+      </row>
     </tbody>
    </tgroup>
   </table>