mirror of
https://github.com/postgres/postgres.git
synced 2025-07-28 23:42:10 +03:00
tsearch2 module
This commit is contained in:
975
contrib/tsearch2/docs/tsearch-V2-intro.html
Normal file
975
contrib/tsearch2/docs/tsearch-V2-intro.html
Normal file
@ -0,0 +1,975 @@
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>tsearch-v2-intro</title>
|
||||
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<div class="content">
|
||||
<h2>Tsearch2 - Introduction</h2>
|
||||
|
||||
<p><a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
|
||||
[Online version]</a> of this document is available.</p>
|
||||
|
||||
<p>The tsearch2 module is available to add as an extension to
|
||||
the PostgreSQL database to allow for Full Text Indexing. This
|
||||
document is an introduction to installing, configuring, using
|
||||
and maintaining the database with the tsearch2 module
|
||||
activated.</p>
|
||||
|
||||
<p>Please, note, tsearch2 module is fully incompatible with old
|
||||
tsearch, which is deprecated in 7.4 and will be obsoleted in
|
||||
7.5.</p>
|
||||
|
||||
<h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
|
||||
ENGINE</h3>
|
||||
|
||||
<p>This documentation is provided as a short guide on how to
|
||||
quickly get up and running with tsearch2 and PostgreSQL, for
|
||||
those who want to implement a full text indexed based search
|
||||
engine. It is not meant to be a complete in-depth guide into
|
||||
the full ins and outs of the contrib/tsearch2 module, and is
|
||||
primarily aimed at beginners who want to speed up searching of
|
||||
large text fields, or those migrating from other database
|
||||
systems such as MS-SQL.</p>
|
||||
|
||||
<p>The README.tsearch2 file included in the contrib/tsearch2
|
||||
directory contains a brief overview and history behind tsearch.
|
||||
This can also be found online <a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
|
||||
here]</a>.</p>
|
||||
|
||||
<p>Further in depth documentation such as a full function
|
||||
reference, and user guide can be found online at the <a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
|
||||
documentation home]</a>.</p>
|
||||
|
||||
<h3>ACKNOWLEDGEMENTS</h3>
|
||||
|
||||
<p>Robert John Shepherd originally wrote this documentation for
|
||||
the previous version of tsearch module (v1) included with the
|
||||
postgres release. I took his documentation and updated it to
|
||||
comply with the tsearch2 modifications.</p>
|
||||
|
||||
<p>Robert's original acknowledgements:</p>
|
||||
|
||||
<p>"Thanks to Oleg Bartunov for taking the time to answer many
|
||||
of my questions regarding this module, and also to Teodor
|
||||
Sigaev for clearing up the process of making your own
|
||||
dictionaries. Plus of course a big thanks to the pair of them
|
||||
for writing this module in the first place!"</p>
|
||||
|
||||
<p>I would also like to extend my thanks to the developers, and
|
||||
Oleg Bartunov for all of his direction and help with the new
|
||||
features of tsearch2.</p>
|
||||
|
||||
<h3>OVERVIEW</h3>
|
||||
|
||||
<p>MS-SQL provides a full text indexing (FTI) system which
|
||||
enables the fast searching of text based fields, very useful
|
||||
for websites (and other applications) that require a results
|
||||
set based on key words. PostgreSQL ships with a contributed
|
||||
module called tsearch2, which implements a special type of
|
||||
index that can also be used for full text indexing. Further
|
||||
more, unlike MS' offering which requires regular incremental
|
||||
rebuilds of the text indexes themselves, tsearch2 indexes are
|
||||
always up-to-date and keeping them so induces very little
|
||||
overhead.</p>
|
||||
|
||||
<p>Before we get into the details, it is recommended that you
|
||||
have installed and tested PostgreSQL, are reasonably familiar
|
||||
with databases, the SQL query language and also understand the
|
||||
basics of connecting to PostgreSQL from the local shell. This
|
||||
document isn't intended for the complete PostgreSQL newbie, but
|
||||
anyone with a reasonable grasp of the basics should be able to
|
||||
follow it.</p>
|
||||
|
||||
<h3>INSTALLATION</h3>
|
||||
|
||||
<p>Starting with PostgreSQL version 7.4 tsearch2 is now
|
||||
included in the contrib directory with the PostgreSQL sources.
|
||||
contrib/tsearch2 is where you will find everything needed to
|
||||
install and use tsearch2. Please note that tsearch2 will also
|
||||
work with PostgreSQL version 7.3.x, but it is not the module
|
||||
included with the source distribution. You will have to
|
||||
download the module separately and install it in the same
|
||||
fashion.</p>
|
||||
|
||||
<p>I installed the tsearch2 module to a PostgreSQL 7.3 database
|
||||
from the contrib directory without squashing the original (old)
|
||||
tsearch module. What I did was move the modules tsearch src
|
||||
driectory into the contrib tree under the name tsearchV2.</p>
|
||||
|
||||
<p>Step one is to download the tsearch V2 module :</p>
|
||||
|
||||
<p><a href=
|
||||
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
|
||||
(check Development History for latest stable version !)</p>
|
||||
<pre>
|
||||
tar -zxvf tsearch-v2.tar.gz
|
||||
mv tsearch2 PGSQL_SRC/contrib/
|
||||
cd PGSQL_SRC/contrib/tsearch2
|
||||
</pre>
|
||||
|
||||
<p>If you are installing from PostgreSQL version 7.4 or higher,
|
||||
you can skip those steps and just change to the
|
||||
contrib/tsearch2 directory in the source tree and continue from
|
||||
there.</p>
|
||||
|
||||
<p>Then continue with the regular building and installation
|
||||
process</p>
|
||||
<pre>
|
||||
gmake
|
||||
gmake install
|
||||
gmake installcheck
|
||||
</pre>
|
||||
|
||||
<p>That is pretty much all you have to do, unless of course you
|
||||
get errors. However if you get those, you better go check with
|
||||
the mailing lists over at <a href=
|
||||
"http://www.postgresql.org">http://www.postgresql.org</a> or
|
||||
<a href=
|
||||
"http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
|
||||
since its never failed for me.</p>
|
||||
|
||||
<p>The directory in the contib/ and the directory from the
|
||||
archive is called tsearch2. Tsearch2 is completely incompatible
|
||||
with the previous version of tsearch. This means that both
|
||||
versions can be installed into a single database, and migration
|
||||
the new version may be much easier.</p>
|
||||
|
||||
<p>NOTE: the previous version of tsearch found in the
|
||||
contrib/tsearch directory is depricated. ALthough it is still
|
||||
available and included within PostgreSQL version 7.4. It will
|
||||
be removed in version 7.5.</p>
|
||||
|
||||
<h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
|
||||
|
||||
<p>We should create a database to use as an example for the
|
||||
remainder of this file. We can call the database "ftstest". You
|
||||
can create it from the command line like this:</p>
|
||||
<pre>
|
||||
#createdb ftstest
|
||||
</pre>
|
||||
|
||||
<p>If you thought installation was easy, this next bit is even
|
||||
easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
|
||||
type:</p>
|
||||
<pre>
|
||||
psql ftstest < tsearch2.sql
|
||||
</pre>
|
||||
|
||||
<p>The file "tsearch2.sql" holds all the wonderful little
|
||||
goodies you need to do full text indexing. It defines numerous
|
||||
functions and operators, and creates the needed tables in the
|
||||
database. There will be 4 new tables created after running the
|
||||
tsearch2.sql file : pg_ts_dict, pg_ts_parser, pg_ts_cfg,
|
||||
pg_ts_cfgmap are added.</p>
|
||||
|
||||
<p>You can check out the tables if you like:</p>
|
||||
<pre>
|
||||
#psql ftstest
|
||||
ftstest=# \d
|
||||
List of relations
|
||||
Schema | Name | Type | Owner
|
||||
--------+--------------+-------+----------
|
||||
public | pg_ts_cfg | table | kopciuch
|
||||
public | pg_ts_cfgmap | table | kopciuch
|
||||
public | pg_ts_dict | table | kopciuch
|
||||
public | pg_ts_parser | table | kopciuch
|
||||
(4 rows)
|
||||
</pre>
|
||||
|
||||
<h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
|
||||
|
||||
<p>The first thing we can do is try out some of the types that
|
||||
are provided for us. Lets look at the tsvector type provided
|
||||
for us:</p>
|
||||
<pre>
|
||||
SELECT 'Our first string used today'::tsvector;
|
||||
tsvector
|
||||
---------------------------------------
|
||||
'Our' 'used' 'first' 'today' 'string'
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>The results are the words used within our string. Notice
|
||||
they are not in any particular order. The tsvector type returns
|
||||
a string of space separated words.</p>
|
||||
<pre>
|
||||
SELECT 'Our first string used today first string'::tsvector;
|
||||
tsvector
|
||||
-----------------------------------------------
|
||||
'Our' 'used' 'again' 'first' 'today' 'string'
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Notice the results string has each unique word ('first' and
|
||||
'string' only appear once in the tsvector value). Which of
|
||||
course makes sense if you are searching the full text ... you
|
||||
only need to know each unique word in the text.</p>
|
||||
|
||||
<p>Those examples were just casting a text field to that of
|
||||
type tsvector. Lets check out one of the new functions created
|
||||
by the tsearch2 module.</p>
|
||||
|
||||
<p>The function to_tsvector has 3 possible signatures:</p>
|
||||
<pre>
|
||||
to_tsvector(oid, text);
|
||||
to_tsvector(text, text);
|
||||
to_tsvector(text);
|
||||
</pre>
|
||||
|
||||
<p>We will use the second method using two text fields. The
|
||||
overloaded methods provide us with a way to specifiy the way
|
||||
the searchable text is broken up into words (Stemming process).
|
||||
Right now we will specify the 'default' configuration. See the
|
||||
section on TSEARCH2 CONFIGURATION to learn more about this.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default',
|
||||
'Our first string used today first string');
|
||||
to_tsvector
|
||||
--------------------------------------------
|
||||
'use':4 'first':2,6 'today':5 'string':3,7
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>The result returned from this function is of type tsvector.
|
||||
The results came about by this reasoning: All of the words in
|
||||
the text passed in are stemmed, or not used because they are
|
||||
stop words defined in our configuration. Each lower case
|
||||
morphed word is returned with all of the positons in the
|
||||
text.</p>
|
||||
|
||||
<p>In this case the word "Our" is a stop word in the default
|
||||
configuration. That means it will not be included in the
|
||||
result. The word "first" is found at positions 2 and 6
|
||||
(although "Our" is a stop word, it's position is maintained).
|
||||
The word(s) positioning is maintained exactly as in the
|
||||
original string. The word "used" is morphed to the word "use"
|
||||
based on the default configuration for word stemming, and is
|
||||
found at position 4. The rest of the results follow the same
|
||||
logic. Just a reminder again ... the order of the 'word'
|
||||
position in the output is not in any kind of order. (ie 'use':4
|
||||
appears first)</p>
|
||||
|
||||
<p>If you want to view the output of the tsvector fields
|
||||
without their positions, you can do so with the function
|
||||
"strip(tsvector)".</p>
|
||||
<pre>
|
||||
SELECT strip(to_tsvector('default',
|
||||
'Our first string used today first string'));
|
||||
strip
|
||||
--------------------------------
|
||||
'use' 'first' 'today' 'string'
|
||||
</pre>
|
||||
|
||||
<p>If you wish to know the number of unique words returned in
|
||||
the tsvector you can do so by using the function
|
||||
"length(tsvector)"</p>
|
||||
<pre>
|
||||
SELECT length(to_tsvector('default',
|
||||
'Our first string used today first string'));
|
||||
length
|
||||
--------
|
||||
4
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Lets take a look at the function to_tsquery. It also has 3
|
||||
signatures which follow the same rational as the to_tsvector
|
||||
function:</p>
|
||||
<pre>
|
||||
to_tsquery(oid, text);
|
||||
to_tsquery(text, text);
|
||||
to_tsquery(text);
|
||||
</pre>
|
||||
|
||||
<p>Lets try using the function with a single word :</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'word');
|
||||
to_tsquery
|
||||
-----------
|
||||
'word'
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>I call the function the same way I would a to_tsvector
|
||||
function, specifying the 'default' configuration for morphing,
|
||||
and the result is the stemmed output 'word'.</p>
|
||||
|
||||
<p>Lets attempt to use the function with a string of multiple
|
||||
words:</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'this is many words');
|
||||
ERROR: Syntax error
|
||||
</pre>
|
||||
|
||||
<p>The function can not accept a space separated string. The
|
||||
intention of the to_tsquery function is to return a type of
|
||||
"tsquery" used for searching a tsvector field. What we need to
|
||||
do is search for one to many words with some kind of logic (for
|
||||
now simple boolean).</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'searching|sentence');
|
||||
to_tsquery
|
||||
----------------------
|
||||
'search' | 'sentenc'
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Notice that the words are separated by the boolean logic
|
||||
"OR", the text could contain boolean operators &,|,!,()
|
||||
with their usual meaning.</p>
|
||||
|
||||
<p>You can not use words defined as being a stop word in your
|
||||
configuration. The function will not fail ... you will just get
|
||||
no result, and a NOTICE like this:</p>
|
||||
<pre>
|
||||
SELECT to_tsquery('default', 'a|is&not|!the');
|
||||
NOTICE: Query contains only stopword(s)
|
||||
or doesn't contain lexem(s), ignored
|
||||
to_tsquery
|
||||
-----------
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>That is a beginning to using the types, and functions
|
||||
defined in the tsearch2 module. There are numerous more
|
||||
functions that I have not touched on. You can read through the
|
||||
tsearch2.sql file built when compiling to get more familiar
|
||||
with what is included.</p>
|
||||
|
||||
<h3>INDEXING FIELDS IN A TABLE</h3>
|
||||
|
||||
<p>The next stage is to add a full text index to an existing
|
||||
table. In this example we already have a table defined as
|
||||
follows:</p>
|
||||
<pre>
|
||||
CREATE TABLE tblMessages
|
||||
(
|
||||
intIndex int4,
|
||||
strTopic varchar(100),
|
||||
strMessage text
|
||||
);
|
||||
</pre>
|
||||
|
||||
<p>We are assuming there are several rows with some kind of
|
||||
data in them. Any data will do, just do several inserts with
|
||||
test strings for a topic, and a message. here is some test data
|
||||
I inserted. (yes I know it's completely useless stuff ;-) but
|
||||
it will serve our purpose right now).</p>
|
||||
<pre>
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('1', 'Testing Topic', 'Testing message data input');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('3', 'Famous Author', 'Stephen King');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('4', 'Political Topic',
|
||||
'Nelson Mandella is released from prison');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('5', 'Nursery rhyme phrase',
|
||||
'Little jack horner sat in a corner');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('6', 'Gettysburg address quotation',
|
||||
'Four score and seven years ago'
|
||||
' our fathers brought forth on this'
|
||||
' continent a new nation, conceived in'
|
||||
' liberty and dedicated to the proposition'
|
||||
' that all men are created equal');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('7', 'Classic Rock Bands',
|
||||
'Led Zeppelin Grateful Dead and The Sex Pistols');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('8', 'My birth address',
|
||||
'18 Sommervile road, Regina, Saskatchewan');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('9', 'Joke', 'knock knock : who\'s there?'
|
||||
' I will not finish this joke');
|
||||
INSERT INTO tblMessages
|
||||
VALUES ('10', 'Computer information',
|
||||
'My computer is a pentium III 400 mHz'
|
||||
' with 192 megabytes of RAM');
|
||||
</pre>
|
||||
|
||||
<p>The next stage is to create a special text index which we
|
||||
will use for FTI, so we can search our table of messages for
|
||||
words or a phrase. We do this using the SQL command:</p>
|
||||
<pre>
|
||||
ALTER TABLE tblMessages ADD idxFTI tsvector;
|
||||
</pre>
|
||||
|
||||
<p>Note that unlike traditional indexes, this is actually a new
|
||||
field in the same table, which is then used (through the magic
|
||||
of the tsearch2 operators and functions) by a special index we
|
||||
will create in a moment.</p>
|
||||
|
||||
<p>The general rule for the initial insertion of data will
|
||||
follow four steps:</p>
|
||||
<pre>
|
||||
1. update table
|
||||
2. vacuum full analyze
|
||||
3. create index
|
||||
4. vacuum full analyze
|
||||
</pre>
|
||||
|
||||
<p>The data can be updated into the table, the vacuum full
|
||||
analyze will reclaim unused space. The index can be created on
|
||||
the table after the data has been inserted. Having the index
|
||||
created prior to the update will slow down the process. It can
|
||||
be done in that manner, this way is just more efficient. After
|
||||
the index has been created on the table, vacuum full analyze is
|
||||
run again to update postgres's statistics (ie having the index
|
||||
take effect).</p>
|
||||
<pre>
|
||||
UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
||||
<p>Note that this only inserts the field strMessage as a
|
||||
tsvector, so if you want to also add strTopic to the
|
||||
information stored, you should instead do the following, which
|
||||
effectively concatenates the two fields into one before being
|
||||
inserted into the table:</p>
|
||||
<pre>
|
||||
UPDATE tblMessages
|
||||
SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
||||
<p><strong>Using the coalesce function makes sure this
|
||||
concatenation also works with NULL fields.</strong></p>
|
||||
|
||||
<p>We need to create the index on the column idxFTI. Keep in
|
||||
mind that the database will update the index when some action
|
||||
is taken. In this case we _need_ the index (The whole point of
|
||||
Full Text INDEXINGi ;-)), so don't worry about any indexing
|
||||
overhead. We will create an index based on the gist function.
|
||||
GiST is an index structure for Generalized Search Tree.</p>
|
||||
<pre>
|
||||
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
||||
VACUUM FULL ANALYZE;
|
||||
</pre>
|
||||
|
||||
<p>After you have converted all of your data and indexed the
|
||||
column, you can select some rows to see what actually happened.
|
||||
I will not display output here but you can play around
|
||||
yourselves and see what happened.</p>
|
||||
|
||||
<p>The last thing to do is set up a trigger so every time a row
|
||||
in this table is changed, the text index is automatically
|
||||
updated. This is easily done using:</p>
|
||||
<pre>
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
|
||||
</pre>
|
||||
|
||||
<p>Or if you are indexing both strMessage and strTopic you
|
||||
should instead do:</p>
|
||||
<pre>
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE
|
||||
tsearch2(idxFTI, strTopic, strMessage);
|
||||
</pre>
|
||||
|
||||
<p>Before you ask, the tsearch2 function accepts multiple
|
||||
fields as arguments so there is no need to concatenate the two
|
||||
into one like we did before.</p>
|
||||
|
||||
<p>If you want to do something specific with columns, you may
|
||||
write your very own trigger function using plpgsql or other
|
||||
procedural languages (but not SQL, unfortunately) and use it
|
||||
instead of <em>tsearch2</em> trigger.</p>
|
||||
|
||||
<p>You could however call other stored procedures from within
|
||||
the tsearch2 function. Lets say we want to create a function to
|
||||
remove certain characters (like the @ symbol from all
|
||||
text).</p>
|
||||
<pre>
|
||||
CREATE FUNCTION dropatsymbol(text)
|
||||
RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
|
||||
</pre>
|
||||
|
||||
<p>Now we can use this function within the tsearch2 function on
|
||||
the trigger.</p>
|
||||
<pre>
|
||||
DROP TRIGGER tsvectorupdate ON tblmessages;
|
||||
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
||||
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
|
||||
INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
|
||||
</pre>
|
||||
|
||||
<p>If at this point you receive an error stating: ERROR: Can't
|
||||
find tsearch config by locale</p>
|
||||
|
||||
<p>Do not worry. You have done nothing wrong. And tsearch2 is
|
||||
not broken. All that has happened here is that the
|
||||
configuration is setup to use a configuration based on the
|
||||
locale of the server. All you have to do is change your default
|
||||
configuration, or add a new one for your specific locale. See
|
||||
the section on TSEARCH2 CONFIGURATION.</p>
|
||||
<pre class="real">
|
||||
SELECT * FROM tblmessages WHERE intindex = 69;
|
||||
|
||||
intindex | strtopic | strmessage | idxfti
|
||||
----------+--------------------------+---------------+-----------------------
|
||||
69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
|
||||
(1 row)
|
||||
</pre>Notice that the string content was passed throught the stored
|
||||
procedure dropatsymbol. The '@' character was replaced with a
|
||||
single space ... and the output from the procedure was then stored
|
||||
in the tsvector column.
|
||||
|
||||
<p>This could be useful for removing other characters from
|
||||
indexed text, or any kind of preprocessing needed to be done on
|
||||
the text prior to insertion into the index.</p>
|
||||
|
||||
<h3>QUERYING A TABLE</h3>
|
||||
|
||||
<p>There are some examples in the README.tsearch2 file for
|
||||
querying a table. One major difference between tsearch and
|
||||
tsearch2 is the operator ## is no longer available. Only the
|
||||
operator @@ is defined, using the types tsvector on one side
|
||||
and tsquery on the other side.</p>
|
||||
|
||||
<p>Lets search the indexed data for the word "Test". I indexed
|
||||
based on the the concatenation of the strTopic, and the
|
||||
strMessage:</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ 'test'::tsquery;
|
||||
intindex | strtopic
|
||||
----------+---------------
|
||||
1 | Testing Topic
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>The only result that matched was the row with a topic
|
||||
"Testing Topic". Notice that the word I search for was all
|
||||
lowercase. Let's see what happens when I query for uppercase
|
||||
"Test".</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ 'Test'::tsquery;
|
||||
intindex | strtopic
|
||||
----------+----------
|
||||
(0 rows)
|
||||
</pre>
|
||||
|
||||
<p>We get zero rows returned. The reason is because when the
|
||||
text was inserted, it was morphed to my default configuration
|
||||
(because of the call to to_tsvector in the UPDATE statement).
|
||||
If there was no morphing done, and the tsvector field(s)
|
||||
contained the word 'Text', a match would have been found.</p>
|
||||
|
||||
<p>Most likely the best way to query the field is to use the
|
||||
to_tsquery function on the right hand side of the @@ operator
|
||||
like this:</p>
|
||||
<pre>
|
||||
SELECT intindex, strtopic FROM tblmessages
|
||||
WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
|
||||
intindex | strtopic
|
||||
----------+--------------------
|
||||
1 | Testing Topic
|
||||
7 | Classic Rock Bands
|
||||
(2 rows)
|
||||
</pre>
|
||||
|
||||
<p>That query searched for all instances of "Test" OR
|
||||
"Zeppelin". It returned two rows: the "Testing Topic" row, and
|
||||
the "Classic Rock Bands" row. The to_tsquery function performed
|
||||
the correct morphology upon the parameters, and searched the
|
||||
tsvector field appropriately.</p>
|
||||
|
||||
<p>The last example here relates to searching for a phrase, for
|
||||
example "minority report". This poses a problem with regard to
|
||||
tsearch2, as it doesn't index phrases, only words. But there is
|
||||
a way around which doesn't appear to have a significant impact
|
||||
on query time, and that is to use a query such as the
|
||||
following:</p>
|
||||
<pre>
|
||||
SELECT intindex, strTopic FROM tblmessages
|
||||
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
|
||||
AND strMessage ~* '.*men are created equal.*';
|
||||
intindex | strtopic
|
||||
----------+------------------------------
|
||||
6 | Gettysburg address quotation
|
||||
(1 row)
|
||||
SELECT intindex, strTopic FROM tblmessages
|
||||
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
|
||||
AND strMessage ~* '.*something that does not exist.*';
|
||||
intindex | strtopic
|
||||
----------+----------
|
||||
(0 rows)
|
||||
</pre>
|
||||
|
||||
<p>Of course if your indexing both strTopic and strMessage, and
|
||||
want to search for this phrase on both, then you will have to
|
||||
get out the brackets and extend this query a little more.</p>
|
||||
|
||||
<h3>TSEARCH2 CONFIGURATION</h3>
|
||||
|
||||
<p>Some words such as "and", "the", and "who" are automatically
|
||||
not indexed, since they belong to a pre-existing dictionary of
|
||||
"Stop Words" which tsearch2 does not perform indexing on. If
|
||||
someone needs to search for "The Who" in your database, they
|
||||
are going to have a tough time coming up with any results,
|
||||
since both are ignored in the indexes. But there is a
|
||||
solution.</p>
|
||||
|
||||
<p>Lets say we want to add a word into the stop word list for
|
||||
english stemming. We could edit the file
|
||||
:'/usr/local/pgsql/share/english.stop' and add a word to the
|
||||
list. I edited mine to exclude my name from indexing:</p>
|
||||
<pre>
|
||||
- Edit /usr/local/pgsql/share/english.stop
|
||||
- Add 'andy' to the list
|
||||
- Save the file.
|
||||
</pre>
|
||||
|
||||
<p>When you connect to the database, the dict_init procedure is
|
||||
run during initialization. And in my configuration it will read
|
||||
the stop words from the file I just edited. If you were
|
||||
connected to the DB while editing the stop words, you will need
|
||||
to end the current session and re-connect. When you re-connect
|
||||
to the database, 'andy' is no longer indexed:</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Andy');
|
||||
to_tsvector
|
||||
------------
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>Originally I would get the result :</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Andy');
|
||||
to_tsvector
|
||||
------------
|
||||
'andi':1
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>But since I added it as a stop word, it would be ingnored on
|
||||
the indexing. The stop word added was used in the dictionary
|
||||
"en_stem". If I were to use a different configuration such as
|
||||
'simple', the results would be different. There are no stop
|
||||
words for the simple dictionary. It will just convert to lower
|
||||
case, and index every unique word.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('simple', 'Andy andy The the in out');
|
||||
to_tsvector
|
||||
-------------------------------------
|
||||
'in':5 'out':6 'the':3,4 'andy':1,2
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>All this talk about which configuration to use is leading us
|
||||
into the actual configuration of tsearch2. In the examples in
|
||||
this document the configuration has always been specified when
|
||||
using the tsearch2 functions:</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default', 'Testing the default config');
|
||||
SELECT to_tsvector('simple', 'Example of simple Config');
|
||||
</pre>
|
||||
|
||||
<p>The pg_ts_cfg table holds each configuration you can use
|
||||
with the tsearch2 functions. As you can see the ts_name column
|
||||
contains both the 'default' configurations based on the 'C'
|
||||
locale. And the 'simple' configuration which is not based on
|
||||
any locale.</p>
|
||||
<pre>
|
||||
SELECT * from pg_ts_cfg;
|
||||
ts_name | prs_name | locale
|
||||
-----------------+----------+--------------
|
||||
default | default | C
|
||||
default_russian | default | ru_RU.KOI8-R
|
||||
simple | default |
|
||||
(3 rows)
|
||||
</pre>
|
||||
|
||||
<p>Each row in the pg_ts_cfg table contains the name of the
|
||||
tsearch2 configuration, the name of the parser to use, and the
|
||||
locale mapped to the configuration. There is only one parser to
|
||||
choose from the table pg_ts_parser called 'default'. More
|
||||
parsers could be written, but for our needs we will use the
|
||||
default.</p>
|
||||
|
||||
<p>There are 3 configurations installed by tsearch2 initially.
|
||||
If your locale is set to 'en_US' for example (like my laptop),
|
||||
then as you can see there is currently no dictionary configured
|
||||
to use with that locale. You can either set up a new
|
||||
configuration or just use one that already exists. If I do not
|
||||
specify which configuration to use in the to_tsvector function,
|
||||
I receive the following error.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('learning tsearch is like going to school');
|
||||
ERROR: Can't find tsearch config by locale
|
||||
</pre>
|
||||
|
||||
<p>We will create a new configuration for use with the server
|
||||
encoding 'en_US'. The first step is to add a new configuration
|
||||
into the pg_ts_cfg table. We will call the configuration
|
||||
'default_english', with the default parser and use the locale
|
||||
'en_US'.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
|
||||
VALUES ('default_english', 'default', 'en_US');
|
||||
</pre>
|
||||
|
||||
<p>We have only declared that there is a configuration called
|
||||
'default_english'. We need to set the configuration of how
|
||||
'default_english' will work. The next step is creating a new
|
||||
dictionary to use. The configuration of the dictionary is
|
||||
completlely different in tsearch2. In the prior versions to
|
||||
make changes, you would have to re-compile your changes into
|
||||
the tsearch.so. All of the configuration has now been moved
|
||||
into the system tables created by executing the SQL code from
|
||||
tsearch2.sql</p>
|
||||
|
||||
<p>Lets take a first look at the pg_ts_dict table</p>
|
||||
<pre>
|
||||
ftstest=# \d pg_ts_dict
|
||||
Table "public.pg_ts_dict"
|
||||
Column | Type | Modifiers
|
||||
-----------------+---------+-----------
|
||||
dict_name | text | not null
|
||||
dict_init | oid |
|
||||
dict_initoption | text |
|
||||
dict_lemmatize | oid | not null
|
||||
dict_comment | text |
|
||||
Indexes: pg_ts_dict_idx unique btree (dict_name)
|
||||
</pre>
|
||||
|
||||
<p>The dict_name column is the name of the dictionary, for
|
||||
example 'simple', 'en_stem' or 'ru_stem'. The dict_init column
|
||||
is an OID of a stored procedure to run for initialization of
|
||||
that dictionary, for example 'snb_en_init' or 'snb_ru_init'.
|
||||
The dict_init option is used for options passed to the init
|
||||
function for the stored procedure. In the cases of 'en_stem' or
|
||||
'ru_stem' it is a path to a stopword file for that dictionary,
|
||||
for example '/usr/local/pgsql/share/english.stop'. This is
|
||||
however dictated by the dictionary. ISpell dictionaries may
|
||||
require different options. The dict_lemmatize column is another
|
||||
OID of a stored procedure to the function used to lemmitize,
|
||||
for example 'snb_lemmatize'. The dict_comment column is just a
|
||||
comment.</p>
|
||||
|
||||
<p>Next we will configure the use of a new dictionary based on
|
||||
ISpell. We will assume you have ISpell installed on you
|
||||
machine. (in /usr/local/lib)</p>
|
||||
|
||||
<p>First lets register the dictionary(ies) to use from ISpell.
|
||||
We will use the english dictionary from ISpell. We insert the
|
||||
paths to the relevant ISpell dictionary (*.hash) and affixes
|
||||
(*.aff) files. There seems to be some question as to which
|
||||
ISpell files are to be used. I installed ISpell from the latest
|
||||
sources on my computer. The installation installed the
|
||||
dictionary files with an extension of *.hash. Some
|
||||
installations install with an extension of *.dict As far as I
|
||||
know the two extensions are equivilant. So *.hash ==
|
||||
*.dict.</p>
|
||||
|
||||
<p>We will also continue to use the english word stop file that
|
||||
was installed for the en_stem dictionary. You could use a
|
||||
different one if you like. The ISpell configuration is based on
|
||||
the "ispell_template" dictionary installed by default with
|
||||
tsearch2. We will use the OIDs to the stored procedures from
|
||||
the row where the dict_name = 'ispell_template'.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_dict
|
||||
(SELECT 'en_ispell',
|
||||
dict_init,
|
||||
'DictFile="/usr/local/lib/english.hash",'
|
||||
'AffFile="/usr/local/lib/english.aff",'
|
||||
'StopFile="/usr/local/pgsql/share/english.stop"',
|
||||
dict_lexize
|
||||
FROM pg_ts_dict
|
||||
WHERE dict_name = 'ispell_template');
|
||||
</pre>
|
||||
|
||||
<p>Next we need to set up the configuration for mapping the
|
||||
dictionay use to the lexxem parsings. This will be done by
|
||||
altering the pg_ts_cfgmap table. We will insert several rows,
|
||||
specifying to using the new dictionary we installed and
|
||||
configured for use within tsearch2. There are several type of
|
||||
lexims we would be concerned with forcing the use of the ISpell
|
||||
dictionary.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
|
||||
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
|
||||
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
||||
VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
|
||||
</pre>
|
||||
|
||||
<p>We have just inserted 3 records to the configuration
|
||||
mapping, specifying that the lexem types for "lhword,
|
||||
lpart_hword and lword" are to be stemmed using the 'en_ispell'
|
||||
dictionary we added into pg_ts_dict, when using the
|
||||
configuration ' default_english' which we added to
|
||||
pg_ts_cfg.</p>
|
||||
|
||||
<p>There are several other lexem types used that we do not need
|
||||
to specify as using the ISpell dictionary. We can simply insert
|
||||
values using the 'simple' stemming process dictionary.</p>
|
||||
<pre>
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'url', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'host', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'sfloat', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'uri', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'int', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'float', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'email', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'word', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'hword', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'nlword', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'nlpart_hword', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'part_hword', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'nlhword', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'file', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'uint', '{simple}');
|
||||
INSERT INTO pg_ts_cfgmap
|
||||
VALUES ('default_english', 'version', '{simple}');
|
||||
</pre>
|
||||
|
||||
<p>Our addition of a configuration for 'default_english' is now
|
||||
complete. We have successfully created a new tsearch2
|
||||
configuration. At the same time we have also set the new
|
||||
configuration to be our default for en_US locale.</p>
|
||||
<pre>
|
||||
SELECT to_tsvector('default_english',
|
||||
'learning tsearch is like going to school');
|
||||
to_tsvector
|
||||
--------------------------------------------------
|
||||
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
|
||||
SELECT to_tsvector('learning tsearch is like going to school');
|
||||
to_tsvector
|
||||
--------------------------------------------------
|
||||
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
|
||||
(1 row)
|
||||
</pre>
|
||||
|
||||
<p>In the case that you already have a configuration set for
|
||||
the locale, and you are changing it to your new dictionary
|
||||
configuration. You will have to set the old locale to NULL. If
|
||||
we are using the 'C' locale then we would do this:</p>
|
||||
<pre>
|
||||
UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
|
||||
</pre>
|
||||
|
||||
<p>That about wraps up the configuration of tsearch2. There is
|
||||
much more you can do with the tables provided. This was just an
|
||||
introduction to get things working rather quickly.</p>
|
||||
|
||||
<h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
|
||||
|
||||
<p>To aid in the addition of new dictionaries to the tsearch2
|
||||
module you can use another additional module in combination
|
||||
with tsearch2. The gendict module is included into tsearch2
|
||||
distribution and is available from gendict/ subdirectory.</p>
|
||||
|
||||
<p>I will not go into detail about installation and
|
||||
instructions on how to use gendict to it's fullest extent right
|
||||
now. You can read the README.gendict ... it has all of the
|
||||
instructions and information you will need.</p>
|
||||
|
||||
<h3>BACKING UP AND RESTORING DATABASES THAT FEATURE
|
||||
TSEARCH2</h3>
|
||||
|
||||
<p>Believe it or not, this isn't as straight forward as it
|
||||
should be, and you will have problems trying to backup and
|
||||
restore any database which uses tsearch2 unless you take the
|
||||
steps shown below. And before you ask using pg_dumpall will
|
||||
result in failure every time. These took a lot of trial and
|
||||
error to get working, but the process as laid down below has
|
||||
been used a dozen times now in live production environments so
|
||||
it should work fine.</p>
|
||||
|
||||
<p>HOWEVER never rely on anyone elses instructions to backup
|
||||
and restore a database system, always develop and understand
|
||||
your own methodology, and test it numerous times before you
|
||||
need to do it for real.</p>
|
||||
|
||||
<p>To Backup a PostgreSQL database that uses the tsearch2
|
||||
module:</p>
|
||||
|
||||
<p>1) Backup any global database objects such as users and
|
||||
groups (this step is usually only necessary when you will be
|
||||
restoring to a virgin system)</p>
|
||||
<pre>
|
||||
pg_dumpall -g > GLOBALobjects.sql
|
||||
</pre>
|
||||
|
||||
<p>2) Backup the full database schema using pg_dump</p>
|
||||
<pre>
|
||||
pg_dump -s DATABASE > DATABASEschema.sql
|
||||
</pre>
|
||||
|
||||
<p>3) Backup the full database using pg_dump</p>
|
||||
<pre>
|
||||
pg_dump -Fc DATABASE > DATABASEdata.tar
|
||||
</pre>
|
||||
|
||||
<p>To Restore a PostgreSQL database that uses the tsearch2
|
||||
module:</p>
|
||||
|
||||
<p>1) Create the blank database</p>
|
||||
<pre>
|
||||
createdb DATABASE
|
||||
</pre>
|
||||
|
||||
<p>2) Restore any global database objects such as users and
|
||||
groups (this step is usually only necessary when you will be
|
||||
restoring to a virgin system)</p>
|
||||
<pre>
|
||||
psql DATABASE < GLOBALobjects.sql
|
||||
</pre>
|
||||
|
||||
<p>3) Create the tsearch2 objects, functions and operators</p>
|
||||
<pre>
|
||||
psql DATABASE < tsearch2.sql
|
||||
</pre>
|
||||
|
||||
<p>4) Edit the backed up database schema and delete all SQL
|
||||
commands which create tsearch2 related functions, operators and
|
||||
data types, BUT NOT fields in table definitions that specify
|
||||
tsvector types. If your not sure what these are, they are the
|
||||
ones listed in tsearch2.sql. Then restore the edited schema to
|
||||
the database</p>
|
||||
<pre>
|
||||
psql DATABASE < DATABASEschema.sql
|
||||
</pre>
|
||||
|
||||
<p>5) Restore the data for the database</p>
|
||||
<pre>
|
||||
pg_restore -N -a -d DATABASE DATABASEdata.tar
|
||||
</pre>
|
||||
|
||||
<p>If you get any errors in step 4, it will most likely be
|
||||
because you forgot to remove an object that was created in
|
||||
tsearch2.sql. Any errors in step 5 will mean the database
|
||||
schema was probably restored wrongly.</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
1057
contrib/tsearch2/docs/tsearch2-guide.html
Normal file
1057
contrib/tsearch2/docs/tsearch2-guide.html
Normal file
File diff suppressed because it is too large
Load Diff
448
contrib/tsearch2/docs/tsearch2-ref.html
Normal file
448
contrib/tsearch2/docs/tsearch2-ref.html
Normal file
@ -0,0 +1,448 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
||||
<html>
|
||||
<head>
|
||||
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
||||
<title>tsearch2 reference</title>
|
||||
</head>
|
||||
<body>
|
||||
<h1 align=center>The tsearch2 Reference</h1>
|
||||
|
||||
<p align=center>
|
||||
Brandon Craig Rhodes<br>30 June 2003
|
||||
<p>
|
||||
This Reference documents the user types and functions
|
||||
of the tsearch2 module for PostgreSQL.
|
||||
An introduction to the module is provided
|
||||
by the <a href="tsearch2-guide.html">tsearch2 Guide</a>,
|
||||
a companion document to this one.
|
||||
You can retrieve a beta copy of the tsearch2 module from the
|
||||
<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
|
||||
page — look under the section entitled <i>Development History</i>
|
||||
for the current version.
|
||||
|
||||
<h2><a name="vq">Vectors and Queries</h2>
|
||||
|
||||
Vectors and queries both store lexemes,
|
||||
but for different purposes.
|
||||
A <tt>tsvector</tt> stores the lexemes
|
||||
of the words that are parsed out of a document,
|
||||
and can also remember the position of each word.
|
||||
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
|
||||
<p>
|
||||
Any of the following functions with a <tt><i>configuration</i></tt> argument
|
||||
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
|
||||
to select a configuration;
|
||||
if the option is omitted, then the current configuration is used.
|
||||
For more information on the current configuration,
|
||||
read the next section on Configurations.
|
||||
|
||||
<h3>Vector Operations</h3>
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
|
||||
<i>document</i> TEXT) RETURNS tsvector</tt>
|
||||
<dd>
|
||||
Parses a document into tokens,
|
||||
reduces the tokens to lexemes,
|
||||
and returns a <tt>tsvector</tt> which lists the lexemes
|
||||
together with their positions in the document.
|
||||
For the best description of this process,
|
||||
see the section on <a href="tsearch2-guide.html#ps">Parsing and Stemming</a>
|
||||
in the accompanying tsearch2 Guide.
|
||||
<dt>
|
||||
<tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
|
||||
<dd>
|
||||
Return a vector which lists the same lexemes
|
||||
as the given <tt><i>vector</i></tt>,
|
||||
but which lacks any information
|
||||
about where in the document each lexeme appeared.
|
||||
While the returned vector is thus useless for relevance ranking,
|
||||
it will usually be much smaller.
|
||||
<dt>
|
||||
<tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
|
||||
<dd>
|
||||
This function returns a copy of the input vector
|
||||
in which every location has been labelled
|
||||
with either the <tt><i>letter</i></tt>
|
||||
<tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
|
||||
or the default label <tt>'D'</tt>
|
||||
(which is the default with which new vectors are created,
|
||||
and as such is usually not displayed).
|
||||
These labels are retained when vectors are concatenated,
|
||||
allowing words from different parts of a document
|
||||
to be weighted differently by ranking functions.
|
||||
<dt>
|
||||
<tt><i>vector1</i> || <i>vector2</i></tt>
|
||||
<dt class=br>
|
||||
<tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
|
||||
RETURNS tsvector</tt>
|
||||
<dd>
|
||||
Returns a vector which combines the lexemes and position information
|
||||
in the two vectors given as arguments.
|
||||
Position weight labels (described in the previous paragraph)
|
||||
are retained intact during the concatenation.
|
||||
This has at least two uses.
|
||||
First,
|
||||
if some sections of your document
|
||||
need be parsed with different configurations than others,
|
||||
you can parse them separately
|
||||
and concatenate the resulting vectors into one.
|
||||
Second,
|
||||
you can weight words from some sections of you document
|
||||
more heavily than those from others by:
|
||||
parsing the sections into separate vectors;
|
||||
assigning the vectors different position labels
|
||||
with the <tt>setweight()</tt> function;
|
||||
concatenating them into a single vector;
|
||||
and then providing a <tt><i>weights</i></tt> argument
|
||||
to the <tt>rank()</tt> function
|
||||
that assigns different weights to positions with different labels.
|
||||
<dt>
|
||||
<tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
|
||||
<dd>
|
||||
Returns the number of lexemes stored in the vector.
|
||||
<dt>
|
||||
<tt><i>text</i>::tsvector RETURNS tsvector</tt>
|
||||
<dd>
|
||||
Directly casting text to a <tt>tsvector</tt>
|
||||
allows you to directly inject lexemes into a vector,
|
||||
with whatever positions and position weights you choose to specify.
|
||||
The <tt><i>text</i></tt> should be formatted
|
||||
like the vector would be printed by the output of a <tt>SELECT</tt>.
|
||||
See the <a href="tsearch2-guide.html#casting">Casting</a>
|
||||
section in the Guide for details.
|
||||
</dl>
|
||||
|
||||
<h3>Query Operations</h3>
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
|
||||
<i>querytext</i> text) RETURNS tsvector</tt>
|
||||
<dd>
|
||||
Parses a query,
|
||||
which should be single words separated by the boolean operators
|
||||
“<tt>&</tt>” and,
|
||||
“<tt>|</tt>” or,
|
||||
and “<tt>!</tt>” not,
|
||||
which can be grouped using parenthesis.
|
||||
Each word is reduced to a lexeme using the current
|
||||
or specified configuration.
|
||||
</ul>
|
||||
<dt>
|
||||
<tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
|
||||
<dd>
|
||||
This might return a textual representation of the given query.
|
||||
<dt>
|
||||
<tt><i>text</i>::tsquery RETURNS tsquery</tt>
|
||||
<dd>
|
||||
Directly casting text to a <tt>tsquery</tt>
|
||||
allows you to directly inject lexemes into a query,
|
||||
with whatever positions and position weight flags you choose to specify.
|
||||
The <tt><i>text</i></tt> should be formatted
|
||||
like the query would be printed by the output of a <tt>SELECT</tt>.
|
||||
See the <a href="tsearch2-guide.html#casting">Casting</a>
|
||||
section in the Guide for details.
|
||||
</dl>
|
||||
|
||||
<h2><a name="configurations">Configurations</a></h2>
|
||||
|
||||
A configuration specifies all of the equipment necessary
|
||||
to transform a document into a <tt>tsvector</tt>:
|
||||
the parser that breaks its text into tokens,
|
||||
and the dictionaries which then transform each token into a lexeme.
|
||||
Every call to <tt>to_tsvector()</tt> (described above)
|
||||
uses a configuration to perform its processing.
|
||||
Three configurations come with tsearch2:
|
||||
|
||||
<ul>
|
||||
<li><b>default</b> — Indexes words and numbers,
|
||||
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
|
||||
and the <i>simple</i> dictionary for all others.
|
||||
<li><b>default_russian</b> — Indexes words and numbers,
|
||||
using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
|
||||
and the <i>ru_stem</i> Russian Snowball dictionary for all others.
|
||||
<li><b>simple</b> — Processes both words and numbers
|
||||
with the <i>simple</i> dictionary,
|
||||
which neither discards any stop words nor alters them.
|
||||
</ul>
|
||||
|
||||
The tsearch2 modules initially chooses your current configuration
|
||||
by looking for your current locale in the <tt>locale</tt> field
|
||||
of the <tt>pg_ts_cfg</tt> table described below.
|
||||
You can manipulate the current configuration yourself with these functions:
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
|
||||
) RETURNS VOID</tt>
|
||||
<dd>
|
||||
Set the current configuration used by <tt>to_tsvector</tt>
|
||||
and <tt>to_tsquery</tt>.
|
||||
<dt>
|
||||
<tt>show_curcfg() RETURNS INT4</tt>
|
||||
<dd>
|
||||
Returns the integer <tt>id</tt> of the current configuration.
|
||||
</dl>
|
||||
|
||||
<p>
|
||||
Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:
|
||||
|
||||
<pre>create table pg_ts_cfg (
|
||||
id int not null primary key,
|
||||
ts_name text not null,
|
||||
prs_name text not null,
|
||||
locale text
|
||||
);</pre>
|
||||
|
||||
The <tt>id</tt> and <tt>ts_name</tt> are unique values
|
||||
which identify the configuration;
|
||||
the <tt>prs_name</tt> specifies which parser the configuration uses.
|
||||
Once this parser has split document text into tokens,
|
||||
the type of each resulting token —
|
||||
or, more specifically, the type's <tt>lex_alias</tt>
|
||||
as specified in the parser's <tt>lexem_type()</tt> table —
|
||||
is searched for together with the configuration's <tt>ts_name</tt>
|
||||
in the <tt>pg_ts_cfgmap</tt> table:
|
||||
|
||||
<pre>create table pg_ts_cfgmap (
|
||||
ts_name text not null,
|
||||
lex_alias text not null,
|
||||
dict_name text[],
|
||||
primary key (ts_name,lex_alias)
|
||||
);</pre>
|
||||
|
||||
Those tokens whose types are not listed are discarded.
|
||||
The remaining tokens are assigned integer positions,
|
||||
starting with 1 for the first token in the document,
|
||||
and turned into lexemes with the help of the dictionaries
|
||||
whose names are given in the <tt>dict_name</tt> array for their type.
|
||||
These dictionaries are tried in order,
|
||||
stopping either with the first one to return a lexeme for the token,
|
||||
or discarding the token if no dictionary returns a lexeme for it.
|
||||
|
||||
<h2><a name="dictionaries">Parsers</a></h2>
|
||||
|
||||
Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:
|
||||
|
||||
<pre>create table pg_ts_parser (
|
||||
prs_id int not null primary key,
|
||||
prs_name text not null,
|
||||
prs_start oid not null,
|
||||
prs_getlexem oid not null,
|
||||
prs_end oid not null,
|
||||
prs_headline oid not null,
|
||||
prs_lextype oid not null,
|
||||
prs_comment text
|
||||
);</pre>
|
||||
|
||||
The <tt>prs_id</tt> and <tt>prs_name</tt> uniquely identify the parser,
|
||||
while <tt>prs_comment</tt> usually describes its name and version
|
||||
for the reference of users.
|
||||
The other items identify the low-level functions
|
||||
which make the parser operate,
|
||||
and are only of interest to someone writing a parser of their own.
|
||||
<p>
|
||||
The tsearch2 module comes with one parser named <tt>default</tt>
|
||||
which is suitable for parsing most plain text and HTML documents.
|
||||
<p>
|
||||
Each <tt><i>parser</i></tt> argument below
|
||||
must designate a parser with either an integer <tt><i>prs_id</i></tt>
|
||||
or a textual <tt><i>prs_name</i></tt>;
|
||||
the current parser is used when this argument is omitted.
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
|
||||
<dd>
|
||||
Selects a current parser
|
||||
which will be used when any of the following functions
|
||||
are called without a parser as an argument.
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION lexem_type(
|
||||
<em>[</em> <i>parser</i> <em>]</em>
|
||||
) RETURNS SETOF lexemtype</tt>
|
||||
<dd>
|
||||
Returns a table which defines and describes
|
||||
each kind of token the parser may produce as output.
|
||||
For each token type the table gives the <tt>lexid</tt>
|
||||
which the parser will label each token of that type,
|
||||
the <tt>alias</tt> which names the token type,
|
||||
and a short description <tt>descr</tt> for the user to read.
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION parse(
|
||||
<em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
|
||||
) RETURNS SETOF lexemtype</tt>
|
||||
<dd>
|
||||
Parses the given document and returns a series of records,
|
||||
one for each token produced by parsing.
|
||||
Each token includes a <tt>lexid</tt> giving its type
|
||||
and a <tt>lexem</tt> which gives its content.
|
||||
</dl>
|
||||
|
||||
<h2><a name="dictionaries">Dictionaries</a></h2>
|
||||
|
||||
Dictionaries take textual tokens as input,
|
||||
usually those produced by a parser,
|
||||
and return lexemes which are usually some reduced form of the token.
|
||||
Among the dictionaries which come installed with tsearch2 are:
|
||||
|
||||
<ul>
|
||||
<li><b>simple</b> simply folds uppercase letters to lowercase
|
||||
before returning the word.
|
||||
<li><b>en_stem</b> runs an English Snowball stemmer on each word
|
||||
that attempts to reduce the various forms of a verb or noun
|
||||
to a single recognizable form.
|
||||
<li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
|
||||
</ul>
|
||||
|
||||
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
|
||||
|
||||
<pre>CREATE TABLE pg_ts_dict (
|
||||
dict_id int not null primary key,
|
||||
dict_name text not null,
|
||||
dict_init oid,
|
||||
dict_initoption text,
|
||||
dict_lemmatize oid not null,
|
||||
dict_comment text
|
||||
);</pre>
|
||||
|
||||
The <tt>dict_id</tt> and <tt>dict_name</tt>
|
||||
serve as unique identifiers for the dictionary.
|
||||
The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
|
||||
but for the built-in Snowball dictionaries
|
||||
it specifies a file from which stop words should be read.
|
||||
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
|
||||
The other fields are internal function identifiers
|
||||
useful only to developers trying to implement their own dictionaries.
|
||||
<p>
|
||||
The argument named <tt><i>dictionary</i></tt>
|
||||
in each of the following functions
|
||||
should be either an integer <tt>dict_id</tt> or a textual <tt>dict_name</tt>
|
||||
identifying which dictionary should be used for the operation;
|
||||
if omitted then the current dictionary is used.
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
|
||||
<dd>
|
||||
Selects a current dictionary for use by functions
|
||||
that do not select a dictionary explicitly.
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION lexize(
|
||||
<em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
|
||||
RETURNS TEXT[]</tt>
|
||||
<dd>
|
||||
Reduces a single word to a lexeme.
|
||||
Note that lexemes are arrays of zero or more strings,
|
||||
since in some languages there might be several base words
|
||||
from which an inflected form could arise.
|
||||
</dl>
|
||||
|
||||
<h2><a name="ranking">Ranking</a></h2>
|
||||
|
||||
Ranking attempts to measure how relevant documents are to particular queries
|
||||
by inspecting the number of times each search word appears in the document,
|
||||
and whether different search terms occur near each other.
|
||||
Note that this information is only available in unstripped vectors —
|
||||
ranking functions will only return a useful result
|
||||
for a <tt>tsvector</tt> which still has position information!
|
||||
<p>
|
||||
Both of these ranking functions
|
||||
take an integer <i>normalization</i> option
|
||||
that specifies whether a document's length should impact its rank.
|
||||
This is often desirable,
|
||||
since a hundred-word document with five instances of a search word
|
||||
is probably more relevant than a thousand-word document with five instances.
|
||||
The option can have the values:
|
||||
|
||||
<ul>
|
||||
<li><tt>0</tt> (the default) ignores document length.
|
||||
<li><tt>1</tt> divides the rank by the logarithm of the length.
|
||||
<li><tt>2</tt> divides the rank by the length itself.
|
||||
</ul>
|
||||
|
||||
The two ranking functions currently available are:
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION rank(<br>
|
||||
<em>[</em> <i>weights</i> float4[], <em>]</em>
|
||||
<i>vector</i> tsvector, <i>query</i> tsquery,
|
||||
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
||||
) RETURNS float4</tt>
|
||||
<dd>
|
||||
This is the ranking function from the old version of OpenFTS,
|
||||
and offers the ability to weight word instances more heavily
|
||||
depending on how you have classified them.
|
||||
The <i>weights</i> specify how heavily to weight each category of word:
|
||||
<pre
|
||||
>{<i>D-weight</i>, <i>A-weight</i>, <i>B-weight</i>, <i>C-weight</i>}</pre>
|
||||
If no weights are provided, then these defaults are used:
|
||||
<pre>{0.1, 0.2, 0.4, 1.0}</pre>
|
||||
Often weights are used to mark words from special areas of the document,
|
||||
like the title or an initial abstract,
|
||||
and make them more or less important than words in the document body.
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION rank_cd(<br>
|
||||
<em>[</em> <i>K</i> int4, <em>]</em>
|
||||
<i>vector</i> tsvector, <i>query</i> tsquery,
|
||||
<em>[</em> <i>normalization</i> int4 <em>]</em><br>
|
||||
) RETURNS float4</tt>
|
||||
<dd>
|
||||
This function computes the cover density ranking
|
||||
for the given document <i>vector</i> and <i>query</i>,
|
||||
as described in Clarke, Cormack, and Tudhope's
|
||||
“<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
|
||||
>Relevance Ranking for One to Three Term Queries</a>”
|
||||
in the 1999 <i>Information Processing and Management</i>.
|
||||
The value <i>K</i> is one of the values from their formula,
|
||||
and defaults to <i>K</i>=4.
|
||||
The examples in their paper <i>K</i>=16;
|
||||
we can roughly describe the term
|
||||
as stating how far apart two search terms can fall
|
||||
before the formula begins penalizing them for lack of proximity.
|
||||
</dl>
|
||||
|
||||
<h2><a name="headlines">Headlines</a></h2>
|
||||
|
||||
<dl>
|
||||
<dt>
|
||||
<tt>CREATE FUNCTION headline(<br>
|
||||
<em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
|
||||
<i>document</i> text, <i>query</i> tsquery,
|
||||
<em>[</em> <i>options</i> text <em>]</em><br>
|
||||
) RETURNS text</tt>
|
||||
<dd>
|
||||
Every form of the the <tt>headline()</tt> function
|
||||
accepts a <tt>document</tt> along with a <tt>query</tt>,
|
||||
and returns one or more ellipse-separated excerpts from the document
|
||||
in which terms from the query are highlighted.
|
||||
The configuration with which to parse the document
|
||||
can be specified by either its <i>id</i> or <i>ts_name</i>;
|
||||
if none is specified that the current configuration is used instead.
|
||||
<p>
|
||||
An <i>options</i> string if provided should be a comma-separated list
|
||||
of one or more ‘<i>option</i><tt>=</tt><i>value</i>’ pairs.
|
||||
The available options are:
|
||||
<ul>
|
||||
<li><tt>StartSel</tt>, <tt>StopSel</tt> —
|
||||
the strings with which query words appearing in the document
|
||||
should be delimited to distinguish them from other excerpted words.
|
||||
<li><tt>MaxWords</tt>, <tt>MinWords</tt> —
|
||||
limits on the shortest and longest headlines you will accept.
|
||||
<li><tt>ShortWord</tt> —
|
||||
this prevents your headline from beginning or ending
|
||||
with a word which has this many characters or less.
|
||||
The default value of <tt>3</tt> should eliminate most English
|
||||
conjunctions and articles.
|
||||
</ul>
|
||||
Any unspecified options receive these defaults:
|
||||
<pre>
|
||||
StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3
|
||||
</pre>
|
||||
</dl>
|
||||
|
||||
</body>
|
||||
</html>
|
Reference in New Issue
Block a user