txtidx datatype for full text indexing with GiST.

From Oleg Bartunov and Teodor Sigaev.
2025-07-31 22:04:40 +03:00 · 2001-10-12 23:19:09 +00:00
parent c24216bea8
commit b57705673d
27 changed files with 24874 additions and 5 deletions
--- a/contrib/tsearch/README.tsearch
+++ b/contrib/tsearch/README.tsearch
@ -0,0 +1,365 @@
+Tsearch contrib module contains implementation of new data type txtidx - 
+a searchable data type (textual) with indexed access.
+
+All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
+(oleg@sai.msu.su).
+
+IMPORTANT NOTICE:
+
+This is a first step of our work on integration of OpenFTS
+full text search engine (http://openfts.sourceforge.net/) into 
+PostgreSQL. It's based on our recent development of GiST 
+(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
+at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
+and will works only for PostgreSQL version 7.2 and later.
+
+We didn't try to implement a full-featured search engine with 
+stable interfaces but rather experiment with various approaches.
+There are many issues remains (most of them just not documented or
+implemented) but we'd like to present a working prototype
+of full text search engine fully integrated into PostgreSQL to 
+collect user's feedback and recommendations. 
+
+INSTALLATION:
+
+   cd contrib/tsearch
+   gmake
+   gmake install
+
+REGRESSION TEST:
+
+   gmake installcheck
+
+USAGE:
+
+ psql DATABASE < tsearch.sql (from contrib/tsearch)
+  
+
+INTRODUCTION:
+
+This module provides an implementation of a new data type 'txtidx' which is
+a string of a space separated "words". "Words" with spaces 
+should be enclosed in apostrophes and apostrophes inside a "word" should be 
+escaped by backslash.
+
+This is quite different from OpenFTS approach which uses array
+of integers (ID of lexems) and requires storing of lexem-id pairs in database. 
+One of the prominent benefit of this new approach is that  it's possible now
+to perform full text search in a 'natural' way. 
+
+Some examples:
+
+  create table foo (
+	titleidx  txtidx
+  );
+
+ 2 regular words:
+   insert into foo values ( 'the are' );
+ Word with space:
+   insert into foo values ( 'the\\ are' );
+ Words with apostrophe:
+   insert into foo values ( 'value\'s this' );
+ Complex word with apostrophe:
+   insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );
+
+   select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
+   select * from foo where titleidx @@ 'value\'s | this';
+   select * from foo where titleidx @@ '(the|this)&!we';
+
+test=# select 'two words'::txtidx;
+    txtidx     
+---------------
+ 'two' 'words'
+(1 row)
+
+test=# select 'single\\ word'::txtidx;
+    txtidx     
+---------------
+ 'single word'
+(1 row)
+
+
+FULL TEXT SEARCH:
+
+The basic idea of this data type is to use it for full text search inside
+database. If you have a 'text' column title and corresponding column 
+titleidx of type 'txtidx', which contains the same information from 
+text column, then search on title could be replaced by
+searching on titleidx which would be fast because of indexed access.
+
+As a real life example consider database with table 'titles' containing
+titles of mailing list postings in column 'title':
+
+  create table titles (
+        title  text
+ 
+ );
+
+Suppose, you already have a lot of titles and want to do full text search
+on them.
+
+First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
+Add column 'titleidx' of type txtidx, containing space separated words from 
+title. It's possible to use function txt2txtidx(title) to fill 'titleidx' 
+column (see notice 1):
+  
+  -- add titleidx column of type txtidx 
+  alter table titles add titleidx txtidx;
+  update titles set titleidx=txt2txtidx(title);
+
+Create index on titleidx:
+  create index t_idx on titles using gist(titleidx);
+
+and now you can search all titles with words 'patch' and 'gist':
+  select title from titles where titleidx ## 'patch&gist';
+
+Here, ## is a new operation defined for type 'txtidx' which could use index 
+(if exists) built on titleidx. This operator uses morphology to
+expand query, i.e. 
+  ## 'patches&gist' will find titles with 'patch' and 'gist' also.
+If you want to provide query as is, use operator @@ instead:
+  select title from titles where titleidx @@ 'patch&gist';
+but remember, that function txt2txtidx does uses morphology, so you need
+to fill column 'titleidx' using some another way. We hope in future releases
+provide more consistent and convenient interfaces.
+
+Query could contains boolean operators &,|,!,() with their usual meaning,
+for example: 'patch&gist&!cvs', 'patch|cvs'.
+Each operation ( ##, @@ ) requires appropriate query type - 
+ txtidx ## mquery_txt
+ txtidx @@ query_txt
+
+To see what query actually will be used :
+
+test=# select 'patches&gist'::mquery_txt;
+    mquery_txt    
+------------------
+ 'patch' & 'gist'
+(1 row)
+
+test=# select 'patches&gist'::query_txt;
+     query_txt      
+--------------------
+ 'patches' & 'gist'
+(1 row)
+
+Notice the difference !
+
+You could use trigger to be sure column 'titleidx' is consistent
+with any changes in column 'title':
+
+  create trigger txtidxupdate before update or insert on titles
+  for each row execute procedure tsearch(titleidx, title);
+
+This trigger uses the same parser, dictionaries as function
+txt2txtidx (see notice 1).
+Current syntax allows creating trigger for several columns
+you want to be searchable:
+
+  create trigger txtidxupdate before update or insert on titles
+  for each row execute procedure tsearch(titleidx, title1, title2,... );
+
+Use function txtidxsize(titleidx) to get the number of "words" in column
+titleidx. To get total number of words in table titles:
+
+test=# select sum(txtidxsize(titleidx)) from titles;
+   sum   
+---------
+ 1917182
+(1 row)
+
+NOTICES:
+
+1.
+function txt2txtidx and trigger use parser, dictionaries coming with 
+this contrib module on default. Parser is mostly the same as in OpenFTS and 
+dictionaries are simple stemmers (sort of Lovin's stemmer which uses a 
+longest match algorithm.) for english and russian languages. There is a perl
+script makedict/makedict.pl, which could be used to create specific
+dictionaries from files with endings and stop-words. 
+Example files for english and russian languages are available 
+from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
+Run script without parameters to see information about arguments and options. 
+
+Example:
+cd makedict
+./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
+              -o ../dict/YOURDICT.dct
+
+Another options of makedict.pl:
+-f do not execute tolower for any char
+-a function of checking stopword will be work after lemmatize, 
+   default is before
+
+You need to edit dict.h to use your dictionary and, probably, 
+morph.c to change mapdict array.
+
+Don't forget to do 
+  make clean; make; make install
+
+2.
+As it was mentioned above we don't use explicitly ID of lexems
+as in OpenFTS but use hash function (crc32) instead to map lexem to
+integer. Our experiments show that probability of collision is quite small:
+for english text it's about 10**(-6) and 10**(-5) for russian collection.
+Default installation doesn't check for collisions but if your application
+does need to guarantee an exact (no collisions) search, you need 
+to update system table to mark index islossy:
+
+  update pg_amop set amopreqcheck = true where amopclaid = 
+      (select oid from pg_opclass where opcname = 'gist_txtidx_ops');
+
+If you don't bother about collisions :
+
+  update pg_amop set amopreqcheck = false where amopclaid = 
+      (select oid from pg_opclass where opcname = 'gist_txtidx_ops');
+
+3.
+txtidx doesn't preserve words ordering (this is not critical for searching)
+for performance reason, for example:
+
+test=# select 'page two'::txtidx;
+    txtidx    
+--------------
+ 'two' 'page'
+(1 row)
+
+4. 
+Indexed access provided by txtidx data type isn't always good
+because of internal data structure we use (RD-Tree). Particularly,
+queries like '!gist' will be  slower than just a sequential scan,
+because for such queries RD-Tree doesn't provides selectivity on internal 
+nodes and all checks should be processed at leaf nodes, i.t. scan of
+full index. You may play with function query_tree to see how effective
+will be index usage:
+
+test=# select querytree( 'patch&gist'::query_txt );
+    querytree     
+------------------
+ 'patch' & 'gist'
+(1 row)
+
+This is an example of "good" query - index will effective for both words.
+
+test=# select querytree( 'patch&!gist'::query_txt );
+ querytree 
+-----------
+ 'patch'
+(1 row)
+
+This means that index is effective only to search word 'patch' and resulted
+rows will be checked against '!gist'.
+
+test=# select querytree( 'patch|!gist'::query_txt );
+ querytree 
+-----------
+ T
+(1 row)
+
+test=# select querytree( '!gist'::query_txt );
+ querytree 
+-----------
+ T
+(1 row)
+
+These two queries will be processed by scanning of full index !
+Very slow !
+
+5.
+Following selects produce the same result
+
+  select title from titles where titleidx @@ 'patch&gist';
+  select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';
+
+but the former will be more effective, because of internal optimization
+of query executor.
+
+
+TODO:
+
+Better configurability (as in OpenFTS)
+User's interfaces to parser, dictionaries ...
+Write documentation
+
+
+BENCHMARKS:
+
+We use test collection in our experiments which contains 377905
+titles from various mailing lists stored in our mailware
+project. 
+
+All runs were performed on IBM ThinkPad T21 notebook with 
+PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
+We didn't do extensive benchmarking and all
+numbers provide for illustration. Actual performance
+is strongly depends on many factors (query, collection, dictionaries 
+and hardware).
+
+Collection is available for download from
+http://www.sai.msu.su/~megera/postgres/gist/tsearch/ 
+as mw_titles.gz (about 3Mb).
+
+0. install contrib/tsearch module
+1. createdb test
+2. psql test < tsearch.sql (from contrib/tsearch)
+3. zcat mw_titles.gz | psql test 
+   (it will creates table, copy test data and creates index)
+
+Database contains one table:
+
+test=# \d titles
+                Table "titles"
+  Column  |          Type          | Modifiers 
+----------+------------------------+-----------
+ title    | character varying(256) | 
+ titleidx | txtidx                 | 
+Indexes: t_idx
+
+Index was created as:
+   create index t_idx on titles using gist(titleidx);
+   (notice: this operation takes about 14 minutes on my notebook)
+
+Typical select looks like:
+   select title from titles where titleidx @@ 'patch&gist';
+
+Total number of lexems in collection : 1917182
+
+1. We trust index  - we consider index is exact and no
+   checking against tuples is necessary.
+
+   update pg_amop set amopreqcheck = false where amopclaid = 
+   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');
+
+using gist indices
+1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
+2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
+3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
+using gist indices (morph)
+1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
+2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
+3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
+disable gist index
+1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
+2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
+3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
+traditional like
+1: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.206s 0.10%
+2: title ~* 'gist' and title ~* 'patch'  0.000u 0.010s 0m9.205s 0.10%
+3: title ~* 'gist' and title ~* 'patch'  0.010u 0.000s 0m9.208s 0.10%
+
+2. Need to check results against tuples to avoid possible hash collision.
+
+   update pg_amop set amopreqcheck = true where amopclaid = 
+   (select oid from pg_opclass where opcname = 'gist_txtidx_ops');
+
+using gist indices
+1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
+2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
+3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
+using gist indices (morph)
+1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
+2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
+3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%
+
+There are no visible difference between these 2 cases but your
+mileage may vary.