Implement "fastupdate" support for GIN indexes, in which we try to accumulate

multiple index entries in a holding area before adding them to the main index structure. This helps because bulk insert is (usually) significantly faster than retail insert for GIN. This patch also removes GIN support for amgettuple-style index scans. The API defined for amgettuple is difficult to support with fastupdate, and the previously committed partial-match feature didn't really work with it either. We might eventually figure a way to put back amgettuple support, but it won't happen for 8.4. catversion bumped because of change in GIN's pg_am entry, and because the format of GIN indexes changed on-disk (there's a metapage now, and possibly a pending list). Teodor Sigaev
2025-07-28 23:42:10 +03:00 · 2009-03-24 20:17:18 +00:00
parent 9987f66001
commit ff301d6e69
30 changed files with 2012 additions and 177 deletions
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.16 2008/07/22 22:05:24 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.17 2009/03/24 20:17:07 tgl Exp $ -->

 <chapter id="GIN">
 <title>GIN Indexes</title>
@ -100,11 +100,11 @@
       to consult <literal>n</> to determine the data type of
       <literal>query</> and the key values that need to be extracted.
       The number of returned keys must be stored into <literal>*nkeys</>.
-       If the query contains no keys then <function>extractQuery</> 
+       If the query contains no keys then <function>extractQuery</>
       should store 0 or -1 into <literal>*nkeys</>, depending on the
       semantics of the operator.  0 means that every
-       value matches the <literal>query</> and a sequential scan should be 
-       produced.  -1 means nothing can match the <literal>query</>. 
+       value matches the <literal>query</> and a sequential scan should be
+       produced.  -1 means nothing can match the <literal>query</>.
       <literal>pmatch</> is an output argument for use when partial match
       is supported.  To use it, <function>extractQuery</> must allocate
       an array of <literal>*nkeys</> booleans and store its address at
@ -188,9 +188,47 @@
  list of heap pointers (PL, posting list) if the list is small enough.
 </para>

+ <sect2 id="gin-fast-update">
+  <title>GIN fast update technique</title>
+
+  <para>
+   Updating a <acronym>GIN</acronym> index tends to be slow because of the
+   intrinsic nature of inverted indexes: inserting or updating one heap row
+   can cause many inserts into the index (one for each key extracted
+   from the indexed value). As of <productname>PostgreSQL</productname> 8.4,
+   <acronym>GIN</> is capable of postponing much of this work by inserting
+   new tuples into a temporary, unsorted list of pending entries.
+   When the table is vacuumed, or if the pending list becomes too large
+   (larger than <xref linkend="guc-work-mem">), the entries are moved to the
+   main <acronym>GIN</acronym> data structure using the same bulk insert
+   techniques used during initial index creation.  This greatly improves
+   <acronym>GIN</acronym> index update speed, even counting the additional
+   vacuum overhead.  Moreover the overhead can be done by a background
+   process instead of in foreground query processing.
+  </para>
+
+  <para>
+   The main disadvantage of this approach is that searches must scan the list
+   of pending entries in addition to searching the regular index, and so
+   a large list of pending entries will slow searches significantly.
+   Another disadvantage is that, while most updates are fast, an update
+   that causes the pending list to become <quote>too large</> will incur an
+   immediate cleanup cycle and thus be much slower than other updates.
+   Proper use of autovacuum can minimize both of these problems.
+  </para>
+
+  <para>
+   If consistent response time is more important than update speed,
+   use of pending entries can be disabled by turning off the
+   <literal>FASTUPDATE</literal> storage parameter for a
+   <acronym>GIN</acronym> index.  See <xref linkend="sql-createindex"
+   endterm="sql-createindex-title"> for details.
+  </para>
+ </sect2>
+
 <sect2 id="gin-partial-match">
  <title>Partial match algorithm</title>
-  
+
  <para>
   GIN can support <quote>partial match</> queries, in which the query
   does not determine an exact match for one or more keys, but the possible
@ -205,14 +243,6 @@
   to be searched, or greater than zero if the index key is past the range
   that could match.
  </para>
-
-  <para>
-   During a partial-match scan, all <literal>itemPointer</>s for matching keys
-   are OR'ed into a <literal>TIDBitmap</>.
-   The scan fails if the <literal>TIDBitmap</> becomes lossy.
-   In this case an error message will be reported with advice
-   to increase <literal>work_mem</>.
-  </para>
 </sect2>

 </sect1>
@ -225,11 +255,18 @@
   <term>Create vs insert</term>
   <listitem>
    <para>
-     In most cases, insertion into a <acronym>GIN</acronym> index is slow
+     Insertion into a <acronym>GIN</acronym> index can be slow
     due to the likelihood of many keys being inserted for each value.
     So, for bulk insertions into a table it is advisable to drop the GIN
     index and recreate it after finishing bulk insertion.
    </para>
+
+    <para>
+     As of <productname>PostgreSQL</productname> 8.4, this advice is less
+     necessary since delayed indexing is used (see <xref
+     linkend="gin-fast-update"> for details).  But for very large updates
+     it may still be best to drop and recreate the index.
+    </para>
   </listitem>
  </varlistentry>

@ -244,6 +281,23 @@
   </listitem>
  </varlistentry>

+  <varlistentry>
+   <term><xref linkend="guc-work-mem"></term>
+   <listitem>
+    <para>
+     During a series of insertions into an existing <acronym>GIN</acronym>
+     index that has <literal>FASTUPDATE</> enabled, the system will clean up
+     the pending-entry list whenever it grows larger than
+     <varname>work_mem</>.  To avoid fluctuations in observed response time,
+     it's desirable to have pending-list cleanup occur in the background
+     (i.e., via autovacuum).  Foreground cleanup operations can be avoided by
+     increasing <varname>work_mem</> or making autovacuum more aggressive.
+     However, enlarging <varname>work_mem</> means that if a foreground
+     cleanup does occur, it will take even longer.
+    </para>
+   </listitem>
+  </varlistentry>
+
  <varlistentry>
   <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
   <listitem>
@ -311,8 +365,7 @@
  <function>extractQuery</function> must convert an unrestricted search into
  a partial-match query that will scan the whole index.  This is inefficient
  but might be necessary to avoid corner-case failures with operators such
-  as LIKE.  Note however that failure could still occur if the intermediate
-  <literal>TIDBitmap</> becomes lossy.
+  as <literal>LIKE</>.
 </para>
 </sect1>

--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.29 2009/03/05 23:06:45 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.30 2009/03/24 20:17:08 tgl Exp $ -->

 <chapter id="indexam">
 <title>Index Access Method Interface Definition</title>
@ -79,7 +79,7 @@
  </para>

  <para>
-   An individual index is defined by a 
+   An individual index is defined by a
   <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
   entry that describes it as a physical relation, plus a
   <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
@ -239,6 +239,16 @@ amvacuumcleanup (IndexVacuumInfo *info,
   be returned.
  </para>

+  <para>
+   As of <productname>PostgreSQL</productname> 8.4,
+   <function>amvacuumcleanup</> will also be called at completion of an
+   <command>ANALYZE</> operation.  In this case <literal>stats</> is always
+   NULL and any return value will be ignored.  This case can be distinguished
+   by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
+   that the access method do nothing except post-insert cleanup in such a
+   call, and that only in an autovacuum worker process.
+  </para>
+
  <para>
 <programlisting>
 void
@ -344,7 +354,8 @@ amgetbitmap (IndexScanDesc scan,
 </programlisting>
   Fetch all tuples in the given scan and add them to the caller-supplied
   TIDBitmap (that is, OR the set of tuple IDs into whatever set is already
-   in the bitmap).  The number of tuples fetched is returned. 
+   in the bitmap).  The number of tuples fetched is returned (this might be
+   just an approximate count, for instance some AMs do not detect duplicates).
   While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
   indicate that rechecking of the scan conditions is required for specific
   tuple IDs.  This is analogous to the <literal>xs_recheck</> output parameter
@ -521,14 +532,14 @@ amrestrpos (IndexScanDesc scan);
  </para>

  <para>
-   Instead of using <function>amgettuple</>, an index scan can be done with 
+   Instead of using <function>amgettuple</>, an index scan can be done with
   <function>amgetbitmap</> to fetch all tuples in one call.  This can be
   noticeably more efficient than <function>amgettuple</> because it allows
   avoiding lock/unlock cycles within the access method.  In principle
   <function>amgetbitmap</> should have the same effects as repeated
   <function>amgettuple</> calls, but we impose several restrictions to
-   simplify matters.  First of all, <function>amgetbitmap</> returns all 
-   tuples at once and marking or restoring scan positions isn't 
+   simplify matters.  First of all, <function>amgetbitmap</> returns all
+   tuples at once and marking or restoring scan positions isn't
   supported. Secondly, the tuples are returned in a bitmap which doesn't
   have any specific ordering, which is why <function>amgetbitmap</> doesn't
   take a <literal>direction</> argument.  Finally, <function>amgetbitmap</>
@ -572,7 +583,7 @@ amrestrpos (IndexScanDesc scan);
   Aside from the index's own internal consistency requirements, concurrent
   updates create issues about consistency between the parent table (the
   <firstterm>heap</>) and the index.  Because
-   <productname>PostgreSQL</productname> separates accesses 
+   <productname>PostgreSQL</productname> separates accesses
   and updates of the heap from those of the index, there are windows in
   which the index might be inconsistent with the heap.  We handle this problem
   with the following rules:
@ -701,7 +712,7 @@ amrestrpos (IndexScanDesc scan);
   no error should be raised.  (This case cannot occur during the
   ordinary scenario of inserting a row that's just been created by
   the current transaction.  It can happen during
-   <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.) 
+   <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
  </para>

  <para>
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@ -1,5 +1,5 @@
 <!--
-$PostgreSQL: pgsql/doc/src/sgml/ref/create_index.sgml,v 1.70 2009/02/02 19:31:38 alvherre Exp $
+$PostgreSQL: pgsql/doc/src/sgml/ref/create_index.sgml,v 1.71 2009/03/24 20:17:08 tgl Exp $
 PostgreSQL documentation
 -->

@ -294,6 +294,37 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] <replaceable class="parameter">name</re

   </variablelist>

+   <para>
+    <literal>GIN</literal> indexes accept a different parameter:
+   </para>
+
+   <variablelist>
+
+   <varlistentry>
+    <term><literal>FASTUPDATE</></term>
+    <listitem>
+    <para>
+     This setting controls usage of the fast update technique described in
+     <xref linkend="gin-fast-update">.  It is a Boolean parameter:
+     <literal>ON</> enables fast update, <literal>OFF</> disables it.
+     (Alternative spellings of <literal>ON</> and <literal>OFF</> are
+     allowed as described in <xref linkend="config-setting">.)  The
+     default is <literal>ON</>.
+    </para>
+
+    <note>
+     <para>
+      Turning <literal>FASTUPDATE</> off via <command>ALTER INDEX</> prevents
+      future insertions from going into the list of pending index entries,
+      but does not in itself flush previous entries.  You might want to
+      <command>VACUUM</> the table afterward to ensure the pending list is
+      emptied.
+     </para>
+    </note>
+    </listitem>
+   </varlistentry>
+
+   </variablelist>
  </refsect2>

  <refsect2 id="SQL-CREATEINDEX-CONCURRENTLY">
@ -501,6 +532,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
 </programlisting>
  </para>

+  <para>
+   To create a <acronym>GIN</> index with fast updates disabled:
+<programlisting>
+CREATE INDEX gin_idx ON documents_table (locations) WITH (fastupdate = off);
+</programlisting>
+  </para>
+
  <para>
   To create an index on the column <literal>code</> in the table
   <literal>films</> and have the index reside in the tablespace
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@ -1,5 +1,5 @@
 <!--
-$PostgreSQL: pgsql/doc/src/sgml/ref/vacuum.sgml,v 1.54 2008/12/11 18:16:18 tgl Exp $
+$PostgreSQL: pgsql/doc/src/sgml/ref/vacuum.sgml,v 1.55 2009/03/24 20:17:08 tgl Exp $
 PostgreSQL documentation
 -->

@ -160,6 +160,13 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ <replaceable class="PARAMETER">
    <command>VACUUM</> cannot be executed inside a transaction block.
   </para>

+   <para>
+    For tables with <acronym>GIN</> indexes, <command>VACUUM</command> (in
+    any form) also completes any pending index insertions, by moving pending
+    index entries to the appropriate places in the main <acronym>GIN</> index
+    structure.  See <xref linkend="gin-fast-update"> for details.
+   </para>
+
   <para>
    We recommend that active production databases be
    vacuumed frequently (at least nightly), in order to
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.47 2009/01/07 22:40:49 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.48 2009/03/24 20:17:08 tgl Exp $ -->

 <chapter id="textsearch">
 <title id="textsearch-title">Full Text Search</title>
@ -3224,12 +3224,14 @@ SELECT plainto_tsquery('supernovae stars');
    </listitem>
    <listitem>
     <para>
-      GIN indexes are about ten times slower to update than GiST
+      GIN indexes are moderately slower to update than GiST indexes, but
+      about 10 times slower if fast-update support was disabled
+      (see <xref linkend="gin-fast-update"> for details)
     </para>
    </listitem>
    <listitem>
     <para>
-      GIN indexes are two-to-three times larger than GiST
+      GIN indexes are two-to-three times larger than GiST indexes
     </para>
    </listitem>
   </itemizedlist>