mirror of
https://github.com/postgres/postgres.git
synced 2025-07-31 22:04:40 +03:00
txtidx datatype for full text indexing with GiST.
From Oleg Bartunov and Teodor Sigaev.
This commit is contained in:
365
contrib/tsearch/README.tsearch
Normal file
365
contrib/tsearch/README.tsearch
Normal file
@ -0,0 +1,365 @@
|
||||
Tsearch contrib module contains implementation of new data type txtidx -
|
||||
a searchable data type (textual) with indexed access.
|
||||
|
||||
All work was done by Teodor Sigaev (teodor@stack.net) and Oleg Bartunov
|
||||
(oleg@sai.msu.su).
|
||||
|
||||
IMPORTANT NOTICE:
|
||||
|
||||
This is a first step of our work on integration of OpenFTS
|
||||
full text search engine (http://openfts.sourceforge.net/) into
|
||||
PostgreSQL. It's based on our recent development of GiST
|
||||
(Generalized Search Tree) for PostgreSQL 7.2 (see our GiST page
|
||||
at http://www.sai.msu.su/~megera/postgres/gist/ for info about GiST)
|
||||
and will works only for PostgreSQL version 7.2 and later.
|
||||
|
||||
We didn't try to implement a full-featured search engine with
|
||||
stable interfaces but rather experiment with various approaches.
|
||||
There are many issues remains (most of them just not documented or
|
||||
implemented) but we'd like to present a working prototype
|
||||
of full text search engine fully integrated into PostgreSQL to
|
||||
collect user's feedback and recommendations.
|
||||
|
||||
INSTALLATION:
|
||||
|
||||
cd contrib/tsearch
|
||||
gmake
|
||||
gmake install
|
||||
|
||||
REGRESSION TEST:
|
||||
|
||||
gmake installcheck
|
||||
|
||||
USAGE:
|
||||
|
||||
psql DATABASE < tsearch.sql (from contrib/tsearch)
|
||||
|
||||
|
||||
INTRODUCTION:
|
||||
|
||||
This module provides an implementation of a new data type 'txtidx' which is
|
||||
a string of a space separated "words". "Words" with spaces
|
||||
should be enclosed in apostrophes and apostrophes inside a "word" should be
|
||||
escaped by backslash.
|
||||
|
||||
This is quite different from OpenFTS approach which uses array
|
||||
of integers (ID of lexems) and requires storing of lexem-id pairs in database.
|
||||
One of the prominent benefit of this new approach is that it's possible now
|
||||
to perform full text search in a 'natural' way.
|
||||
|
||||
Some examples:
|
||||
|
||||
create table foo (
|
||||
titleidx txtidx
|
||||
);
|
||||
|
||||
2 regular words:
|
||||
insert into foo values ( 'the are' );
|
||||
Word with space:
|
||||
insert into foo values ( 'the\\ are' );
|
||||
Words with apostrophe:
|
||||
insert into foo values ( 'value\'s this' );
|
||||
Complex word with apostrophe:
|
||||
insert into foo values ( 'value\'s this we \'PostgreSQL site\'' );
|
||||
|
||||
select * from foo where titleidx @@ '\'PostgreSQL site\' | this';
|
||||
select * from foo where titleidx @@ 'value\'s | this';
|
||||
select * from foo where titleidx @@ '(the|this)&!we';
|
||||
|
||||
test=# select 'two words'::txtidx;
|
||||
txtidx
|
||||
---------------
|
||||
'two' 'words'
|
||||
(1 row)
|
||||
|
||||
test=# select 'single\\ word'::txtidx;
|
||||
txtidx
|
||||
---------------
|
||||
'single word'
|
||||
(1 row)
|
||||
|
||||
|
||||
FULL TEXT SEARCH:
|
||||
|
||||
The basic idea of this data type is to use it for full text search inside
|
||||
database. If you have a 'text' column title and corresponding column
|
||||
titleidx of type 'txtidx', which contains the same information from
|
||||
text column, then search on title could be replaced by
|
||||
searching on titleidx which would be fast because of indexed access.
|
||||
|
||||
As a real life example consider database with table 'titles' containing
|
||||
titles of mailing list postings in column 'title':
|
||||
|
||||
create table titles (
|
||||
title text
|
||||
|
||||
);
|
||||
|
||||
Suppose, you already have a lot of titles and want to do full text search
|
||||
on them.
|
||||
|
||||
First, you need to install contrib/tsearch module (see INSTALLATION and USAGE).
|
||||
Add column 'titleidx' of type txtidx, containing space separated words from
|
||||
title. It's possible to use function txt2txtidx(title) to fill 'titleidx'
|
||||
column (see notice 1):
|
||||
|
||||
-- add titleidx column of type txtidx
|
||||
alter table titles add titleidx txtidx;
|
||||
update titles set titleidx=txt2txtidx(title);
|
||||
|
||||
Create index on titleidx:
|
||||
create index t_idx on titles using gist(titleidx);
|
||||
|
||||
and now you can search all titles with words 'patch' and 'gist':
|
||||
select title from titles where titleidx ## 'patch&gist';
|
||||
|
||||
Here, ## is a new operation defined for type 'txtidx' which could use index
|
||||
(if exists) built on titleidx. This operator uses morphology to
|
||||
expand query, i.e.
|
||||
## 'patches&gist' will find titles with 'patch' and 'gist' also.
|
||||
If you want to provide query as is, use operator @@ instead:
|
||||
select title from titles where titleidx @@ 'patch&gist';
|
||||
but remember, that function txt2txtidx does uses morphology, so you need
|
||||
to fill column 'titleidx' using some another way. We hope in future releases
|
||||
provide more consistent and convenient interfaces.
|
||||
|
||||
Query could contains boolean operators &,|,!,() with their usual meaning,
|
||||
for example: 'patch&gist&!cvs', 'patch|cvs'.
|
||||
Each operation ( ##, @@ ) requires appropriate query type -
|
||||
txtidx ## mquery_txt
|
||||
txtidx @@ query_txt
|
||||
|
||||
To see what query actually will be used :
|
||||
|
||||
test=# select 'patches&gist'::mquery_txt;
|
||||
mquery_txt
|
||||
------------------
|
||||
'patch' & 'gist'
|
||||
(1 row)
|
||||
|
||||
test=# select 'patches&gist'::query_txt;
|
||||
query_txt
|
||||
--------------------
|
||||
'patches' & 'gist'
|
||||
(1 row)
|
||||
|
||||
Notice the difference !
|
||||
|
||||
You could use trigger to be sure column 'titleidx' is consistent
|
||||
with any changes in column 'title':
|
||||
|
||||
create trigger txtidxupdate before update or insert on titles
|
||||
for each row execute procedure tsearch(titleidx, title);
|
||||
|
||||
This trigger uses the same parser, dictionaries as function
|
||||
txt2txtidx (see notice 1).
|
||||
Current syntax allows creating trigger for several columns
|
||||
you want to be searchable:
|
||||
|
||||
create trigger txtidxupdate before update or insert on titles
|
||||
for each row execute procedure tsearch(titleidx, title1, title2,... );
|
||||
|
||||
Use function txtidxsize(titleidx) to get the number of "words" in column
|
||||
titleidx. To get total number of words in table titles:
|
||||
|
||||
test=# select sum(txtidxsize(titleidx)) from titles;
|
||||
sum
|
||||
---------
|
||||
1917182
|
||||
(1 row)
|
||||
|
||||
NOTICES:
|
||||
|
||||
1.
|
||||
function txt2txtidx and trigger use parser, dictionaries coming with
|
||||
this contrib module on default. Parser is mostly the same as in OpenFTS and
|
||||
dictionaries are simple stemmers (sort of Lovin's stemmer which uses a
|
||||
longest match algorithm.) for english and russian languages. There is a perl
|
||||
script makedict/makedict.pl, which could be used to create specific
|
||||
dictionaries from files with endings and stop-words.
|
||||
Example files for english and russian languages are available
|
||||
from http://www.sai.msu.su/~megera/postgres/gist/tsearch/.
|
||||
Run script without parameters to see information about arguments and options.
|
||||
|
||||
Example:
|
||||
cd makedict
|
||||
./makedict.pl -l LOCALNAME -e FILEENDINGS -s FILESTOPWORD \
|
||||
-o ../dict/YOURDICT.dct
|
||||
|
||||
Another options of makedict.pl:
|
||||
-f do not execute tolower for any char
|
||||
-a function of checking stopword will be work after lemmatize,
|
||||
default is before
|
||||
|
||||
You need to edit dict.h to use your dictionary and, probably,
|
||||
morph.c to change mapdict array.
|
||||
|
||||
Don't forget to do
|
||||
make clean; make; make install
|
||||
|
||||
2.
|
||||
As it was mentioned above we don't use explicitly ID of lexems
|
||||
as in OpenFTS but use hash function (crc32) instead to map lexem to
|
||||
integer. Our experiments show that probability of collision is quite small:
|
||||
for english text it's about 10**(-6) and 10**(-5) for russian collection.
|
||||
Default installation doesn't check for collisions but if your application
|
||||
does need to guarantee an exact (no collisions) search, you need
|
||||
to update system table to mark index islossy:
|
||||
|
||||
update pg_amop set amopreqcheck = true where amopclaid =
|
||||
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
||||
|
||||
If you don't bother about collisions :
|
||||
|
||||
update pg_amop set amopreqcheck = false where amopclaid =
|
||||
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
||||
|
||||
3.
|
||||
txtidx doesn't preserve words ordering (this is not critical for searching)
|
||||
for performance reason, for example:
|
||||
|
||||
test=# select 'page two'::txtidx;
|
||||
txtidx
|
||||
--------------
|
||||
'two' 'page'
|
||||
(1 row)
|
||||
|
||||
4.
|
||||
Indexed access provided by txtidx data type isn't always good
|
||||
because of internal data structure we use (RD-Tree). Particularly,
|
||||
queries like '!gist' will be slower than just a sequential scan,
|
||||
because for such queries RD-Tree doesn't provides selectivity on internal
|
||||
nodes and all checks should be processed at leaf nodes, i.t. scan of
|
||||
full index. You may play with function query_tree to see how effective
|
||||
will be index usage:
|
||||
|
||||
test=# select querytree( 'patch&gist'::query_txt );
|
||||
querytree
|
||||
------------------
|
||||
'patch' & 'gist'
|
||||
(1 row)
|
||||
|
||||
This is an example of "good" query - index will effective for both words.
|
||||
|
||||
test=# select querytree( 'patch&!gist'::query_txt );
|
||||
querytree
|
||||
-----------
|
||||
'patch'
|
||||
(1 row)
|
||||
|
||||
This means that index is effective only to search word 'patch' and resulted
|
||||
rows will be checked against '!gist'.
|
||||
|
||||
test=# select querytree( 'patch|!gist'::query_txt );
|
||||
querytree
|
||||
-----------
|
||||
T
|
||||
(1 row)
|
||||
|
||||
test=# select querytree( '!gist'::query_txt );
|
||||
querytree
|
||||
-----------
|
||||
T
|
||||
(1 row)
|
||||
|
||||
These two queries will be processed by scanning of full index !
|
||||
Very slow !
|
||||
|
||||
5.
|
||||
Following selects produce the same result
|
||||
|
||||
select title from titles where titleidx @@ 'patch&gist';
|
||||
select title from titles where titleidx @@ 'patch' and titleidx @@ 'gist';
|
||||
|
||||
but the former will be more effective, because of internal optimization
|
||||
of query executor.
|
||||
|
||||
|
||||
TODO:
|
||||
|
||||
Better configurability (as in OpenFTS)
|
||||
User's interfaces to parser, dictionaries ...
|
||||
Write documentation
|
||||
|
||||
|
||||
BENCHMARKS:
|
||||
|
||||
We use test collection in our experiments which contains 377905
|
||||
titles from various mailing lists stored in our mailware
|
||||
project.
|
||||
|
||||
All runs were performed on IBM ThinkPad T21 notebook with
|
||||
PIII 733 Mhz, 256 RAM, 20 Gb HDD, Linux 2.2.19, postgresql 7.2.dev
|
||||
We didn't do extensive benchmarking and all
|
||||
numbers provide for illustration. Actual performance
|
||||
is strongly depends on many factors (query, collection, dictionaries
|
||||
and hardware).
|
||||
|
||||
Collection is available for download from
|
||||
http://www.sai.msu.su/~megera/postgres/gist/tsearch/
|
||||
as mw_titles.gz (about 3Mb).
|
||||
|
||||
0. install contrib/tsearch module
|
||||
1. createdb test
|
||||
2. psql test < tsearch.sql (from contrib/tsearch)
|
||||
3. zcat mw_titles.gz | psql test
|
||||
(it will creates table, copy test data and creates index)
|
||||
|
||||
Database contains one table:
|
||||
|
||||
test=# \d titles
|
||||
Table "titles"
|
||||
Column | Type | Modifiers
|
||||
----------+------------------------+-----------
|
||||
title | character varying(256) |
|
||||
titleidx | txtidx |
|
||||
Indexes: t_idx
|
||||
|
||||
Index was created as:
|
||||
create index t_idx on titles using gist(titleidx);
|
||||
(notice: this operation takes about 14 minutes on my notebook)
|
||||
|
||||
Typical select looks like:
|
||||
select title from titles where titleidx @@ 'patch&gist';
|
||||
|
||||
Total number of lexems in collection : 1917182
|
||||
|
||||
1. We trust index - we consider index is exact and no
|
||||
checking against tuples is necessary.
|
||||
|
||||
update pg_amop set amopreqcheck = false where amopclaid =
|
||||
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
||||
|
||||
using gist indices
|
||||
1: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.054s 0.00%
|
||||
2: titleidx @@ 'patch&gist' 0.020u 0.000s 0m0.045s 44.82%
|
||||
3: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.044s 0.00%
|
||||
using gist indices (morph)
|
||||
1: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.62%
|
||||
2: titleidx ## 'patch&gist' 0.010u 0.010s 0m0.046s 43.47%
|
||||
3: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
||||
disable gist index
|
||||
1: titleidx @@ 'patch&gist' 0.000u 0.010s 0m1.601s 0.62%
|
||||
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m1.607s 0.00%
|
||||
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m1.607s 0.62%
|
||||
traditional like
|
||||
1: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.206s 0.10%
|
||||
2: title ~* 'gist' and title ~* 'patch' 0.000u 0.010s 0m9.205s 0.10%
|
||||
3: title ~* 'gist' and title ~* 'patch' 0.010u 0.000s 0m9.208s 0.10%
|
||||
|
||||
2. Need to check results against tuples to avoid possible hash collision.
|
||||
|
||||
update pg_amop set amopreqcheck = true where amopclaid =
|
||||
(select oid from pg_opclass where opcname = 'gist_txtidx_ops');
|
||||
|
||||
using gist indices
|
||||
1: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.052s 19.26%
|
||||
2: titleidx @@ 'patch&gist' 0.000u 0.000s 0m0.045s 0.00%
|
||||
3: titleidx @@ 'patch&gist' 0.010u 0.000s 0m0.045s 22.39%
|
||||
using gist indices (morph)
|
||||
1: titleidx ## 'patch&gist' 0.000u 0.000s 0m0.046s 0.00%
|
||||
2: titleidx ## 'patch&gist' 0.000u 0.010s 0m0.046s 21.75%
|
||||
3: titleidx ## 'patch&gist' 0.020u 0.000s 0m0.047s 42.13%
|
||||
|
||||
There are no visible difference between these 2 cases but your
|
||||
mileage may vary.
|
Reference in New Issue
Block a user