From 0ca9907ce443c58532f1c09e716b7d9cad9039a5 Mon Sep 17 00:00:00 2001 From: Teodor Sigaev Date: Thu, 14 Sep 2006 11:16:27 +0000 Subject: [PATCH] GIN documentation and slightly improving GiST docs. Thanks to Christopher Kings-Lynne for initial version and Jeff Davis for inspection --- doc/src/sgml/config.sgml | 17 +- doc/src/sgml/filelist.sgml | 3 +- doc/src/sgml/geqo.sgml | 6 +- doc/src/sgml/gin.sgml | 231 +++++++++++++++++++++++++++ doc/src/sgml/indices.sgml | 35 +++- doc/src/sgml/mvcc.sgml | 18 ++- doc/src/sgml/ref/create_opclass.sgml | 4 +- doc/src/sgml/xindex.sgml | 101 +++++++++++- 8 files changed, 395 insertions(+), 20 deletions(-) create mode 100644 doc/src/sgml/gin.sgml diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 2dcde4c14d1..12f01d6470e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1,4 +1,4 @@ - + Server Configuration @@ -2172,7 +2172,20 @@ SELECT * FROM parent WHERE key = 2400; - + + + gin_fuzzy_search_limit (integer) + + gin_fuzzy_search_limit configuration parameter + + + + Soft upper limit of the size of the returned set by GIN index. For more + information see . + + + + diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index 23af846a67a..7a8e37dd197 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -1,4 +1,4 @@ - + @@ -78,6 +78,7 @@ + diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml index e8de838a9c9..448b1be542c 100644 --- a/doc/src/sgml/geqo.sgml +++ b/doc/src/sgml/geqo.sgml @@ -1,4 +1,4 @@ - + @@ -49,8 +49,8 @@ methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., - B-tree, hash, GiST in PostgreSQL) as access - paths for relations. + B-tree, hash, GiST and GIN in PostgreSQL) as + access paths for relations. diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml new file mode 100644 index 00000000000..e261b0de6eb --- /dev/null +++ b/doc/src/sgml/gin.sgml @@ -0,0 +1,231 @@ + + + +GIN Indexes + + + index + GIN + + + + Introduction + + + GIN stands for Generalized Inverted Index. It is + an index structure storing a set of (key, posting list) pairs, where + 'posting list' is a set of rows in which the key occurs. The + row may contain many keys. + + + + It is generalized in the sense that a GIN index + does not need to be aware of the operation that it accelerates. + Instead, it uses custom strategies defined for particular data types. + + + + One advantage of GIN is that it allows the development + of custom data types with the appropriate access methods, by + an expert in the domain of the data type, rather than a database expert. + This is much the same advantage as using GiST. + + + + The GIN + implementation in PostgreSQL is primarily + maintained by Teodor Sigaev and Oleg Bartunov, and there is more + information on their + website. + + + + + + Extensibility + + + The GIN interface has a high level of abstraction, + requiring the access method implementer to only implement the semantics of + the data type being accessed. The GIN layer itself + takes care of concurrency, logging and searching the tree structure. + + + + All it takes to get a GIN access method working + is to implement four user-defined methods, which define the behavior of + keys in the tree. In short, GIN combines extensibility + along with generality, code reuse, and a clean interface. + + + + + + Implementation + + + Internally, GIN consists of a B-tree index constructed + over keys, where each key is an element of the indexed value + (element of array, for example) and where each tuple in a leaf page is + either a pointer to a B-tree over heap pointers (PT, posting tree), or a + list of heap pointers (PL, posting list) if the tuple is small enough. + + + + There are four methods that an index operator class for + GIN must provide (prototypes are in pseudocode): + + + + + int compare( Datum a, Datum b ) + + + Compares keys (not indexed values!) and returns an integer less than + zero, zero, or greater than zero, indicating whether the first key is + less than, equal to, or greater than the second. + + + + + + Datum* extractValue(Datum inputValue, uint32 *nkeys) + + + Returns an array of keys of value to be indexed, nkeys should + contain the number of returned keys. + + + + + + Datum* extractQuery(Datum query, uint32 nkeys, + StrategyNumber n) + + + Returns an array of keys of the query to be executed. n contains + strategy number of operation (see ). + Depending on n, query may be different type. + + + + + + bool consistent( bool check[], StrategyNumber n, Datum query) + + + Returns TRUE if indexed value satisfies query qualifier with strategy n + (or may satisfy in case of RECHECK mark in operator class). + Each element of the check array is TRUE if indexed value has a + corresponding key in the query: if (check[i] == TRUE ) the i-th key of + the query is present in the indexed value. + + + + + + + + + +GIN tips and trics + + + + Create vs insert + + + In most cases, insertion into GIN index is slow because + many GIN keys may be inserted for each table row. So, when loading data + in bulk it may be useful to drop index and recreate it + after the data is loaded in the table. + + + + + + gin_fuzzy_search_limit + + + The primary goal of development GIN indices was + support for highly scalable, full-text search in + PostgreSQL and there are often situations when + a full-text search returns a very large set of results. Since reading + tuples from the disk and sorting them could take a lot of time, this is + unacceptable for production. (Note that the index search itself is very + fast.) + + + Such queries usually contain very frequent words, so the results are not + very helpful. To facilitate execution of such queries + GIN has a configurable soft upper limit of the size + of the returned set, determined by the + gin_fuzzy_search_limit GUC variable. It is set to 0 by + default (no limit). + + + If a non-zero search limit is set, then the returned set is a subset of + the whole result set, chosen at random. + + + "Soft" means that the actual number of returned results could slightly + differ from the specified limit, depending on the query and the quality + of the system's random number generator. + + + + + + + + + Limitations + + + GIN doesn't support full scan of index due to it's + extremely inefficiency: because of a lot of keys per value, + each heap pointer will returned several times. + + + + When extractQuery returns zero number of keys, GIN will + emit a error: for different opclass and strategy semantic meaning of void + query may be different (for example, any array contains void array, + but they aren't overlapped with void one), and GIN can't + suggest reasonable answer. + + + + GIN searches keys only by equality matching. This may + be improved in future. + + + + Examples + + + The PostgreSQL source distribution includes + GIN classes for one-dimensional arrays of all internal + types. The following + contrib modules also contain GIN + operator classes: + + + + + intarray + + Enhanced support for int4[] + + + + + tsearch2 + + Support for inverted text indexing. This is much faster for very + large, mostly-static sets of documents. + + + + + diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index d060d044de0..ff011da53ca 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -1,4 +1,4 @@ - + Indexes @@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id); PostgreSQL provides several index types: - B-tree, Hash, and GiST. Each index type uses a different + B-tree, Hash, GIN and GiST. Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command will create a B-tree index, which fits the most common situations. @@ -238,6 +238,37 @@ CREATE INDEX name ON table classes are available in the contrib collection or as separate projects. For more information see . + + + index + GIN + + + GIN + index + + GIN is a inverted index and it's usable for values which have more + than one key, arrays for example. Like to GiST, GIN may support + many different user-defined indexing strategies and the particular + operators with which a GIN index can be used vary depending on the + indexing strategy. + As an example, the standard distribution of + PostgreSQL includes GIN operator classes + for one-dimentional arrays, which support indexed + queries using these operators: + + + <@ + @> + = + && + + + (See for the meaning of + these operators.) + Another GIN operator classes are available in the contrib + tsearch2 and intarray modules. For more information see . + diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index eba07a80dad..baee0d85a2e 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -1,4 +1,4 @@ - + Concurrency Control @@ -987,6 +987,20 @@ UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222; + + + + GIN indexes + + + + Short-term share/exclusive page-level locks are used for + read/write access. Locks are released immediately after each + index row is fetched or inserted. However, note that GIN index + usually requires several inserts per one table row. + + + @@ -995,7 +1009,7 @@ UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222; applications; since they also have more features than hash indexes, they are the recommended index type for concurrent applications that need to index scalar data. When dealing with - non-scalar data, B-trees are not useful, and GiST indexes should + non-scalar data, B-trees are not useful, and GiST or GIN indexes should be used instead. diff --git a/doc/src/sgml/ref/create_opclass.sgml b/doc/src/sgml/ref/create_opclass.sgml index ed7b77b6565..50742980bcd 100644 --- a/doc/src/sgml/ref/create_opclass.sgml +++ b/doc/src/sgml/ref/create_opclass.sgml @@ -1,5 +1,5 @@ @@ -192,7 +192,7 @@ CREATE OPERATOR CLASS name [ DEFAUL The data type actually stored in the index. Normally this is the same as the column data type, but some index methods - (only GiST at this writing) allow it to be different. The + (GIN and GiST for now) allow it to be different. The STORAGE clause must be omitted unless the index method allows a different type to be used. diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index c5c34087bed..3d4ef9e2bdb 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -1,4 +1,4 @@ - + Interfacing Extensions To Indexes @@ -242,6 +242,44 @@ + + GIN indexes are similar to GiST in flexibility: it hasn't a fixed set + of strategies. Instead, the consistency support routine + interprets the strategy numbers accordingly with operator class + definition. As an example, strategies of operator class over arrays + is shown in . + + + + GiST Two-Dimensional <quote>R-tree</> Strategies + + + + Operation + Strategy Number + + + + + overlap + 1 + + + contains + 2 + + + is contained by + 3 + + + equal + 4 + + + +
+ Note that all strategy operators return Boolean values. In practice, all operators defined as index method strategies must @@ -349,37 +387,84 @@ - consistent + consistent - determine whether key satifies the + query qualifier 1 - union + union - compute union of of a set of given keys 2 - compress + compress - computes a compressed representation of a key or value + to be indexed 3 - decompress + decompress - computes a decompressed representation of a + compressed key 4 - penalty + penalty - compute penalty for inserting new key into subtree + with given subtree's key 5 - picksplit + picksplit - determine which entries of a page are to be moved + to the new page and compute the union keys for resulting pages 6 - equal + equal - compare two keys and returns true if they are equal + 7 + + GIN indexes require four support functions, + shown in . + + + + GIN Support Functions + + + + Function + Support Number + + + + + + compare - Compare two keys and return an integer less than zero, zero, or + greater than zero, indicating whether the first key is less than, equal to, + or greater than the second. + + 1 + + + extractValue - extract keys from value to be indexed + 2 + + + extractQuery - extract keys from query + 3 + + + consistent - determine whether value matches by the + query + 4 + + + + +
+ Unlike strategy operators, support functions return whichever data type the particular index method expects; for example in the case