mirror of
https://github.com/postgres/postgres.git
synced 2025-05-12 16:21:30 +03:00
976 lines
39 KiB
HTML
976 lines
39 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
|
<html>
|
|
<head>
|
|
<title>tsearch-v2-intro</title>
|
|
<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
|
|
</head>
|
|
|
|
<body>
|
|
<div class="content">
|
|
<h2>Tsearch2 - Introduction</h2>
|
|
|
|
<p><a href=
|
|
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
|
|
[Online version]</a> of this document is available.</p>
|
|
|
|
<p>The tsearch2 module is available to add as an extension to
|
|
the PostgreSQL database to allow for Full Text Indexing. This
|
|
document is an introduction to installing, configuring, using
|
|
and maintaining the database with the tsearch2 module
|
|
activated.</p>
|
|
|
|
<p>Please, note, tsearch2 module is fully incompatible with old
|
|
tsearch, which is deprecated in 7.4 and will be obsoleted in
|
|
7.5.</p>
|
|
|
|
<h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
|
|
ENGINE</h3>
|
|
|
|
<p>This documentation is provided as a short guide on how to
|
|
quickly get up and running with tsearch2 and PostgreSQL, for
|
|
those who want to implement a full text indexed based search
|
|
engine. It is not meant to be a complete in-depth guide into
|
|
the full ins and outs of the contrib/tsearch2 module, and is
|
|
primarily aimed at beginners who want to speed up searching of
|
|
large text fields, or those migrating from other database
|
|
systems such as MS-SQL.</p>
|
|
|
|
<p>The README.tsearch2 file included in the contrib/tsearch2
|
|
directory contains a brief overview and history behind tsearch.
|
|
This can also be found online <a href=
|
|
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
|
|
here]</a>.</p>
|
|
|
|
<p>Further in depth documentation such as a full function
|
|
reference, and user guide can be found online at the <a href=
|
|
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
|
|
documentation home]</a>.</p>
|
|
|
|
<h3>ACKNOWLEDGEMENTS</h3>
|
|
|
|
<p>Robert John Shepherd originally wrote this documentation for
|
|
the previous version of tsearch module (v1) included with the
|
|
postgres release. I took his documentation and updated it to
|
|
comply with the tsearch2 modifications.</p>
|
|
|
|
<p>Robert's original acknowledgements:</p>
|
|
|
|
<p>"Thanks to Oleg Bartunov for taking the time to answer many
|
|
of my questions regarding this module, and also to Teodor
|
|
Sigaev for clearing up the process of making your own
|
|
dictionaries. Plus of course a big thanks to the pair of them
|
|
for writing this module in the first place!"</p>
|
|
|
|
<p>I would also like to extend my thanks to the developers, and
|
|
Oleg Bartunov for all of his direction and help with the new
|
|
features of tsearch2.</p>
|
|
|
|
<h3>OVERVIEW</h3>
|
|
|
|
<p>MS-SQL provides a full text indexing (FTI) system which
|
|
enables the fast searching of text based fields, very useful
|
|
for websites (and other applications) that require a results
|
|
set based on key words. PostgreSQL ships with a contributed
|
|
module called tsearch2, which implements a special type of
|
|
index that can also be used for full text indexing. Further
|
|
more, unlike MS' offering which requires regular incremental
|
|
rebuilds of the text indexes themselves, tsearch2 indexes are
|
|
always up-to-date and keeping them so induces very little
|
|
overhead.</p>
|
|
|
|
<p>Before we get into the details, it is recommended that you
|
|
have installed and tested PostgreSQL, are reasonably familiar
|
|
with databases, the SQL query language and also understand the
|
|
basics of connecting to PostgreSQL from the local shell. This
|
|
document isn't intended for the complete PostgreSQL newbie, but
|
|
anyone with a reasonable grasp of the basics should be able to
|
|
follow it.</p>
|
|
|
|
<h3>INSTALLATION</h3>
|
|
|
|
<p>Starting with PostgreSQL version 7.4 tsearch2 is now
|
|
included in the contrib directory with the PostgreSQL sources.
|
|
contrib/tsearch2 is where you will find everything needed to
|
|
install and use tsearch2. Please note that tsearch2 will also
|
|
work with PostgreSQL version 7.3.x, but it is not the module
|
|
included with the source distribution. You will have to
|
|
download the module separately and install it in the same
|
|
fashion.</p>
|
|
|
|
<p>I installed the tsearch2 module to a PostgreSQL 7.3 database
|
|
from the contrib directory without squashing the original (old)
|
|
tsearch module. What I did was move the modules tsearch src
|
|
driectory into the contrib tree under the name tsearchV2.</p>
|
|
|
|
<p>Step one is to download the tsearch V2 module :</p>
|
|
|
|
<p><a href=
|
|
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
|
|
(check Development History for latest stable version !)</p>
|
|
<pre>
|
|
tar -zxvf tsearch-v2.tar.gz
|
|
mv tsearch2 PGSQL_SRC/contrib/
|
|
cd PGSQL_SRC/contrib/tsearch2
|
|
</pre>
|
|
|
|
<p>If you are installing from PostgreSQL version 7.4 or higher,
|
|
you can skip those steps and just change to the
|
|
contrib/tsearch2 directory in the source tree and continue from
|
|
there.</p>
|
|
|
|
<p>Then continue with the regular building and installation
|
|
process</p>
|
|
<pre>
|
|
gmake
|
|
gmake install
|
|
gmake installcheck
|
|
</pre>
|
|
|
|
<p>That is pretty much all you have to do, unless of course you
|
|
get errors. However if you get those, you better go check with
|
|
the mailing lists over at <a href=
|
|
"http://www.postgresql.org">http://www.postgresql.org</a> or
|
|
<a href=
|
|
"http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
|
|
since its never failed for me.</p>
|
|
|
|
<p>The directory in the contib/ and the directory from the
|
|
archive is called tsearch2. Tsearch2 is completely incompatible
|
|
with the previous version of tsearch. This means that both
|
|
versions can be installed into a single database, and migration
|
|
the new version may be much easier.</p>
|
|
|
|
<p>NOTE: the previous version of tsearch found in the
|
|
contrib/tsearch directory is depricated. ALthough it is still
|
|
available and included within PostgreSQL version 7.4. It will
|
|
be removed in version 7.5.</p>
|
|
|
|
<h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
|
|
|
|
<p>We should create a database to use as an example for the
|
|
remainder of this file. We can call the database "ftstest". You
|
|
can create it from the command line like this:</p>
|
|
<pre>
|
|
#createdb ftstest
|
|
</pre>
|
|
|
|
<p>If you thought installation was easy, this next bit is even
|
|
easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
|
|
type:</p>
|
|
<pre>
|
|
psql ftstest < tsearch2.sql
|
|
</pre>
|
|
|
|
<p>The file "tsearch2.sql" holds all the wonderful little
|
|
goodies you need to do full text indexing. It defines numerous
|
|
functions and operators, and creates the needed tables in the
|
|
database. There will be 4 new tables created after running the
|
|
tsearch2.sql file : pg_ts_dict, pg_ts_parser, pg_ts_cfg,
|
|
pg_ts_cfgmap are added.</p>
|
|
|
|
<p>You can check out the tables if you like:</p>
|
|
<pre>
|
|
#psql ftstest
|
|
ftstest=# \d
|
|
List of relations
|
|
Schema | Name | Type | Owner
|
|
--------+--------------+-------+----------
|
|
public | pg_ts_cfg | table | kopciuch
|
|
public | pg_ts_cfgmap | table | kopciuch
|
|
public | pg_ts_dict | table | kopciuch
|
|
public | pg_ts_parser | table | kopciuch
|
|
(4 rows)
|
|
</pre>
|
|
|
|
<h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
|
|
|
|
<p>The first thing we can do is try out some of the types that
|
|
are provided for us. Lets look at the tsvector type provided
|
|
for us:</p>
|
|
<pre>
|
|
SELECT 'Our first string used today'::tsvector;
|
|
tsvector
|
|
---------------------------------------
|
|
'Our' 'used' 'first' 'today' 'string'
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>The results are the words used within our string. Notice
|
|
they are not in any particular order. The tsvector type returns
|
|
a string of space separated words.</p>
|
|
<pre>
|
|
SELECT 'Our first string used today first string'::tsvector;
|
|
tsvector
|
|
-----------------------------------------------
|
|
'Our' 'used' 'again' 'first' 'today' 'string'
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>Notice the results string has each unique word ('first' and
|
|
'string' only appear once in the tsvector value). Which of
|
|
course makes sense if you are searching the full text ... you
|
|
only need to know each unique word in the text.</p>
|
|
|
|
<p>Those examples were just casting a text field to that of
|
|
type tsvector. Lets check out one of the new functions created
|
|
by the tsearch2 module.</p>
|
|
|
|
<p>The function to_tsvector has 3 possible signatures:</p>
|
|
<pre>
|
|
to_tsvector(oid, text);
|
|
to_tsvector(text, text);
|
|
to_tsvector(text);
|
|
</pre>
|
|
|
|
<p>We will use the second method using two text fields. The
|
|
overloaded methods provide us with a way to specifiy the way
|
|
the searchable text is broken up into words (Stemming process).
|
|
Right now we will specify the 'default' configuration. See the
|
|
section on TSEARCH2 CONFIGURATION to learn more about this.</p>
|
|
<pre>
|
|
SELECT to_tsvector('default',
|
|
'Our first string used today first string');
|
|
to_tsvector
|
|
--------------------------------------------
|
|
'use':4 'first':2,6 'today':5 'string':3,7
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>The result returned from this function is of type tsvector.
|
|
The results came about by this reasoning: All of the words in
|
|
the text passed in are stemmed, or not used because they are
|
|
stop words defined in our configuration. Each lower case
|
|
morphed word is returned with all of the positons in the
|
|
text.</p>
|
|
|
|
<p>In this case the word "Our" is a stop word in the default
|
|
configuration. That means it will not be included in the
|
|
result. The word "first" is found at positions 2 and 6
|
|
(although "Our" is a stop word, it's position is maintained).
|
|
The word(s) positioning is maintained exactly as in the
|
|
original string. The word "used" is morphed to the word "use"
|
|
based on the default configuration for word stemming, and is
|
|
found at position 4. The rest of the results follow the same
|
|
logic. Just a reminder again ... the order of the 'word'
|
|
position in the output is not in any kind of order. (ie 'use':4
|
|
appears first)</p>
|
|
|
|
<p>If you want to view the output of the tsvector fields
|
|
without their positions, you can do so with the function
|
|
"strip(tsvector)".</p>
|
|
<pre>
|
|
SELECT strip(to_tsvector('default',
|
|
'Our first string used today first string'));
|
|
strip
|
|
--------------------------------
|
|
'use' 'first' 'today' 'string'
|
|
</pre>
|
|
|
|
<p>If you wish to know the number of unique words returned in
|
|
the tsvector you can do so by using the function
|
|
"length(tsvector)"</p>
|
|
<pre>
|
|
SELECT length(to_tsvector('default',
|
|
'Our first string used today first string'));
|
|
length
|
|
--------
|
|
4
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>Lets take a look at the function to_tsquery. It also has 3
|
|
signatures which follow the same rational as the to_tsvector
|
|
function:</p>
|
|
<pre>
|
|
to_tsquery(oid, text);
|
|
to_tsquery(text, text);
|
|
to_tsquery(text);
|
|
</pre>
|
|
|
|
<p>Lets try using the function with a single word :</p>
|
|
<pre>
|
|
SELECT to_tsquery('default', 'word');
|
|
to_tsquery
|
|
-----------
|
|
'word'
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>I call the function the same way I would a to_tsvector
|
|
function, specifying the 'default' configuration for morphing,
|
|
and the result is the stemmed output 'word'.</p>
|
|
|
|
<p>Lets attempt to use the function with a string of multiple
|
|
words:</p>
|
|
<pre>
|
|
SELECT to_tsquery('default', 'this is many words');
|
|
ERROR: Syntax error
|
|
</pre>
|
|
|
|
<p>The function can not accept a space separated string. The
|
|
intention of the to_tsquery function is to return a type of
|
|
"tsquery" used for searching a tsvector field. What we need to
|
|
do is search for one to many words with some kind of logic (for
|
|
now simple boolean).</p>
|
|
<pre>
|
|
SELECT to_tsquery('default', 'searching|sentence');
|
|
to_tsquery
|
|
----------------------
|
|
'search' | 'sentenc'
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>Notice that the words are separated by the boolean logic
|
|
"OR", the text could contain boolean operators &,|,!,()
|
|
with their usual meaning.</p>
|
|
|
|
<p>You can not use words defined as being a stop word in your
|
|
configuration. The function will not fail ... you will just get
|
|
no result, and a NOTICE like this:</p>
|
|
<pre>
|
|
SELECT to_tsquery('default', 'a|is&not|!the');
|
|
NOTICE: Query contains only stopword(s)
|
|
or doesn't contain lexem(s), ignored
|
|
to_tsquery
|
|
-----------
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>That is a beginning to using the types, and functions
|
|
defined in the tsearch2 module. There are numerous more
|
|
functions that I have not touched on. You can read through the
|
|
tsearch2.sql file built when compiling to get more familiar
|
|
with what is included.</p>
|
|
|
|
<h3>INDEXING FIELDS IN A TABLE</h3>
|
|
|
|
<p>The next stage is to add a full text index to an existing
|
|
table. In this example we already have a table defined as
|
|
follows:</p>
|
|
<pre>
|
|
CREATE TABLE tblMessages
|
|
(
|
|
intIndex int4,
|
|
strTopic varchar(100),
|
|
strMessage text
|
|
);
|
|
</pre>
|
|
|
|
<p>We are assuming there are several rows with some kind of
|
|
data in them. Any data will do, just do several inserts with
|
|
test strings for a topic, and a message. here is some test data
|
|
I inserted. (yes I know it's completely useless stuff ;-) but
|
|
it will serve our purpose right now).</p>
|
|
<pre>
|
|
INSERT INTO tblMessages
|
|
VALUES ('1', 'Testing Topic', 'Testing message data input');
|
|
INSERT INTO tblMessages
|
|
VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
|
|
INSERT INTO tblMessages
|
|
VALUES ('3', 'Famous Author', 'Stephen King');
|
|
INSERT INTO tblMessages
|
|
VALUES ('4', 'Political Topic',
|
|
'Nelson Mandella is released from prison');
|
|
INSERT INTO tblMessages
|
|
VALUES ('5', 'Nursery rhyme phrase',
|
|
'Little jack horner sat in a corner');
|
|
INSERT INTO tblMessages
|
|
VALUES ('6', 'Gettysburg address quotation',
|
|
'Four score and seven years ago'
|
|
' our fathers brought forth on this'
|
|
' continent a new nation, conceived in'
|
|
' liberty and dedicated to the proposition'
|
|
' that all men are created equal');
|
|
INSERT INTO tblMessages
|
|
VALUES ('7', 'Classic Rock Bands',
|
|
'Led Zeppelin Grateful Dead and The Sex Pistols');
|
|
INSERT INTO tblMessages
|
|
VALUES ('8', 'My birth address',
|
|
'18 Sommervile road, Regina, Saskatchewan');
|
|
INSERT INTO tblMessages
|
|
VALUES ('9', 'Joke', 'knock knock : who\'s there?'
|
|
' I will not finish this joke');
|
|
INSERT INTO tblMessages
|
|
VALUES ('10', 'Computer information',
|
|
'My computer is a pentium III 400 mHz'
|
|
' with 192 megabytes of RAM');
|
|
</pre>
|
|
|
|
<p>The next stage is to create a special text index which we
|
|
will use for FTI, so we can search our table of messages for
|
|
words or a phrase. We do this using the SQL command:</p>
|
|
<pre>
|
|
ALTER TABLE tblMessages ADD idxFTI tsvector;
|
|
</pre>
|
|
|
|
<p>Note that unlike traditional indexes, this is actually a new
|
|
field in the same table, which is then used (through the magic
|
|
of the tsearch2 operators and functions) by a special index we
|
|
will create in a moment.</p>
|
|
|
|
<p>The general rule for the initial insertion of data will
|
|
follow four steps:</p>
|
|
<pre>
|
|
1. update table
|
|
2. vacuum full analyze
|
|
3. create index
|
|
4. vacuum full analyze
|
|
</pre>
|
|
|
|
<p>The data can be updated into the table, the vacuum full
|
|
analyze will reclaim unused space. The index can be created on
|
|
the table after the data has been inserted. Having the index
|
|
created prior to the update will slow down the process. It can
|
|
be done in that manner, this way is just more efficient. After
|
|
the index has been created on the table, vacuum full analyze is
|
|
run again to update postgres's statistics (ie having the index
|
|
take effect).</p>
|
|
<pre>
|
|
UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
|
|
VACUUM FULL ANALYZE;
|
|
</pre>
|
|
|
|
<p>Note that this only inserts the field strMessage as a
|
|
tsvector, so if you want to also add strTopic to the
|
|
information stored, you should instead do the following, which
|
|
effectively concatenates the two fields into one before being
|
|
inserted into the table:</p>
|
|
<pre>
|
|
UPDATE tblMessages
|
|
SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
|
|
VACUUM FULL ANALYZE;
|
|
</pre>
|
|
|
|
<p><strong>Using the coalesce function makes sure this
|
|
concatenation also works with NULL fields.</strong></p>
|
|
|
|
<p>We need to create the index on the column idxFTI. Keep in
|
|
mind that the database will update the index when some action
|
|
is taken. In this case we _need_ the index (The whole point of
|
|
Full Text INDEXINGi ;-)), so don't worry about any indexing
|
|
overhead. We will create an index based on the gist function.
|
|
GiST is an index structure for Generalized Search Tree.</p>
|
|
<pre>
|
|
CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
|
|
VACUUM FULL ANALYZE;
|
|
</pre>
|
|
|
|
<p>After you have converted all of your data and indexed the
|
|
column, you can select some rows to see what actually happened.
|
|
I will not display output here but you can play around
|
|
yourselves and see what happened.</p>
|
|
|
|
<p>The last thing to do is set up a trigger so every time a row
|
|
in this table is changed, the text index is automatically
|
|
updated. This is easily done using:</p>
|
|
<pre>
|
|
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
|
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
|
|
</pre>
|
|
|
|
<p>Or if you are indexing both strMessage and strTopic you
|
|
should instead do:</p>
|
|
<pre>
|
|
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
|
FOR EACH ROW EXECUTE PROCEDURE
|
|
tsearch2(idxFTI, strTopic, strMessage);
|
|
</pre>
|
|
|
|
<p>Before you ask, the tsearch2 function accepts multiple
|
|
fields as arguments so there is no need to concatenate the two
|
|
into one like we did before.</p>
|
|
|
|
<p>If you want to do something specific with columns, you may
|
|
write your very own trigger function using plpgsql or other
|
|
procedural languages (but not SQL, unfortunately) and use it
|
|
instead of <em>tsearch2</em> trigger.</p>
|
|
|
|
<p>You could however call other stored procedures from within
|
|
the tsearch2 function. Lets say we want to create a function to
|
|
remove certain characters (like the @ symbol from all
|
|
text).</p>
|
|
<pre>
|
|
CREATE FUNCTION dropatsymbol(text)
|
|
RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
|
|
</pre>
|
|
|
|
<p>Now we can use this function within the tsearch2 function on
|
|
the trigger.</p>
|
|
<pre>
|
|
DROP TRIGGER tsvectorupdate ON tblmessages;
|
|
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
|
|
FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
|
|
INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
|
|
</pre>
|
|
|
|
<p>If at this point you receive an error stating: ERROR: Can't
|
|
find tsearch config by locale</p>
|
|
|
|
<p>Do not worry. You have done nothing wrong. And tsearch2 is
|
|
not broken. All that has happened here is that the
|
|
configuration is setup to use a configuration based on the
|
|
locale of the server. All you have to do is change your default
|
|
configuration, or add a new one for your specific locale. See
|
|
the section on TSEARCH2 CONFIGURATION.</p>
|
|
<pre class="real">
|
|
SELECT * FROM tblmessages WHERE intindex = 69;
|
|
|
|
intindex | strtopic | strmessage | idxfti
|
|
----------+--------------------------+---------------+-----------------------
|
|
69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
|
|
(1 row)
|
|
</pre>Notice that the string content was passed throught the stored
|
|
procedure dropatsymbol. The '@' character was replaced with a
|
|
single space ... and the output from the procedure was then stored
|
|
in the tsvector column.
|
|
|
|
<p>This could be useful for removing other characters from
|
|
indexed text, or any kind of preprocessing needed to be done on
|
|
the text prior to insertion into the index.</p>
|
|
|
|
<h3>QUERYING A TABLE</h3>
|
|
|
|
<p>There are some examples in the README.tsearch2 file for
|
|
querying a table. One major difference between tsearch and
|
|
tsearch2 is the operator ## is no longer available. Only the
|
|
operator @@ is defined, using the types tsvector on one side
|
|
and tsquery on the other side.</p>
|
|
|
|
<p>Lets search the indexed data for the word "Test". I indexed
|
|
based on the the concatenation of the strTopic, and the
|
|
strMessage:</p>
|
|
<pre>
|
|
SELECT intindex, strtopic FROM tblmessages
|
|
WHERE idxfti @@ 'test'::tsquery;
|
|
intindex | strtopic
|
|
----------+---------------
|
|
1 | Testing Topic
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>The only result that matched was the row with a topic
|
|
"Testing Topic". Notice that the word I search for was all
|
|
lowercase. Let's see what happens when I query for uppercase
|
|
"Test".</p>
|
|
<pre>
|
|
SELECT intindex, strtopic FROM tblmessages
|
|
WHERE idxfti @@ 'Test'::tsquery;
|
|
intindex | strtopic
|
|
----------+----------
|
|
(0 rows)
|
|
</pre>
|
|
|
|
<p>We get zero rows returned. The reason is because when the
|
|
text was inserted, it was morphed to my default configuration
|
|
(because of the call to to_tsvector in the UPDATE statement).
|
|
If there was no morphing done, and the tsvector field(s)
|
|
contained the word 'Text', a match would have been found.</p>
|
|
|
|
<p>Most likely the best way to query the field is to use the
|
|
to_tsquery function on the right hand side of the @@ operator
|
|
like this:</p>
|
|
<pre>
|
|
SELECT intindex, strtopic FROM tblmessages
|
|
WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
|
|
intindex | strtopic
|
|
----------+--------------------
|
|
1 | Testing Topic
|
|
7 | Classic Rock Bands
|
|
(2 rows)
|
|
</pre>
|
|
|
|
<p>That query searched for all instances of "Test" OR
|
|
"Zeppelin". It returned two rows: the "Testing Topic" row, and
|
|
the "Classic Rock Bands" row. The to_tsquery function performed
|
|
the correct morphology upon the parameters, and searched the
|
|
tsvector field appropriately.</p>
|
|
|
|
<p>The last example here relates to searching for a phrase, for
|
|
example "minority report". This poses a problem with regard to
|
|
tsearch2, as it doesn't index phrases, only words. But there is
|
|
a way around which doesn't appear to have a significant impact
|
|
on query time, and that is to use a query such as the
|
|
following:</p>
|
|
<pre>
|
|
SELECT intindex, strTopic FROM tblmessages
|
|
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
|
|
AND strMessage ~* '.*men are created equal.*';
|
|
intindex | strtopic
|
|
----------+------------------------------
|
|
6 | Gettysburg address quotation
|
|
(1 row)
|
|
SELECT intindex, strTopic FROM tblmessages
|
|
WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
|
|
AND strMessage ~* '.*something that does not exist.*';
|
|
intindex | strtopic
|
|
----------+----------
|
|
(0 rows)
|
|
</pre>
|
|
|
|
<p>Of course if your indexing both strTopic and strMessage, and
|
|
want to search for this phrase on both, then you will have to
|
|
get out the brackets and extend this query a little more.</p>
|
|
|
|
<h3>TSEARCH2 CONFIGURATION</h3>
|
|
|
|
<p>Some words such as "and", "the", and "who" are automatically
|
|
not indexed, since they belong to a pre-existing dictionary of
|
|
"Stop Words" which tsearch2 does not perform indexing on. If
|
|
someone needs to search for "The Who" in your database, they
|
|
are going to have a tough time coming up with any results,
|
|
since both are ignored in the indexes. But there is a
|
|
solution.</p>
|
|
|
|
<p>Lets say we want to add a word into the stop word list for
|
|
english stemming. We could edit the file
|
|
:'/usr/local/pgsql/share/english.stop' and add a word to the
|
|
list. I edited mine to exclude my name from indexing:</p>
|
|
<pre>
|
|
- Edit /usr/local/pgsql/share/english.stop
|
|
- Add 'andy' to the list
|
|
- Save the file.
|
|
</pre>
|
|
|
|
<p>When you connect to the database, the dict_init procedure is
|
|
run during initialization. And in my configuration it will read
|
|
the stop words from the file I just edited. If you were
|
|
connected to the DB while editing the stop words, you will need
|
|
to end the current session and re-connect. When you re-connect
|
|
to the database, 'andy' is no longer indexed:</p>
|
|
<pre>
|
|
SELECT to_tsvector('default', 'Andy');
|
|
to_tsvector
|
|
------------
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>Originally I would get the result :</p>
|
|
<pre>
|
|
SELECT to_tsvector('default', 'Andy');
|
|
to_tsvector
|
|
------------
|
|
'andi':1
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>But since I added it as a stop word, it would be ingnored on
|
|
the indexing. The stop word added was used in the dictionary
|
|
"en_stem". If I were to use a different configuration such as
|
|
'simple', the results would be different. There are no stop
|
|
words for the simple dictionary. It will just convert to lower
|
|
case, and index every unique word.</p>
|
|
<pre>
|
|
SELECT to_tsvector('simple', 'Andy andy The the in out');
|
|
to_tsvector
|
|
-------------------------------------
|
|
'in':5 'out':6 'the':3,4 'andy':1,2
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>All this talk about which configuration to use is leading us
|
|
into the actual configuration of tsearch2. In the examples in
|
|
this document the configuration has always been specified when
|
|
using the tsearch2 functions:</p>
|
|
<pre>
|
|
SELECT to_tsvector('default', 'Testing the default config');
|
|
SELECT to_tsvector('simple', 'Example of simple Config');
|
|
</pre>
|
|
|
|
<p>The pg_ts_cfg table holds each configuration you can use
|
|
with the tsearch2 functions. As you can see the ts_name column
|
|
contains both the 'default' configurations based on the 'C'
|
|
locale. And the 'simple' configuration which is not based on
|
|
any locale.</p>
|
|
<pre>
|
|
SELECT * from pg_ts_cfg;
|
|
ts_name | prs_name | locale
|
|
-----------------+----------+--------------
|
|
default | default | C
|
|
default_russian | default | ru_RU.KOI8-R
|
|
simple | default |
|
|
(3 rows)
|
|
</pre>
|
|
|
|
<p>Each row in the pg_ts_cfg table contains the name of the
|
|
tsearch2 configuration, the name of the parser to use, and the
|
|
locale mapped to the configuration. There is only one parser to
|
|
choose from the table pg_ts_parser called 'default'. More
|
|
parsers could be written, but for our needs we will use the
|
|
default.</p>
|
|
|
|
<p>There are 3 configurations installed by tsearch2 initially.
|
|
If your locale is set to 'en_US' for example (like my laptop),
|
|
then as you can see there is currently no dictionary configured
|
|
to use with that locale. You can either set up a new
|
|
configuration or just use one that already exists. If I do not
|
|
specify which configuration to use in the to_tsvector function,
|
|
I receive the following error.</p>
|
|
<pre>
|
|
SELECT to_tsvector('learning tsearch is like going to school');
|
|
ERROR: Can't find tsearch config by locale
|
|
</pre>
|
|
|
|
<p>We will create a new configuration for use with the server
|
|
encoding 'en_US'. The first step is to add a new configuration
|
|
into the pg_ts_cfg table. We will call the configuration
|
|
'default_english', with the default parser and use the locale
|
|
'en_US'.</p>
|
|
<pre>
|
|
INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
|
|
VALUES ('default_english', 'default', 'en_US');
|
|
</pre>
|
|
|
|
<p>We have only declared that there is a configuration called
|
|
'default_english'. We need to set the configuration of how
|
|
'default_english' will work. The next step is creating a new
|
|
dictionary to use. The configuration of the dictionary is
|
|
completlely different in tsearch2. In the prior versions to
|
|
make changes, you would have to re-compile your changes into
|
|
the tsearch.so. All of the configuration has now been moved
|
|
into the system tables created by executing the SQL code from
|
|
tsearch2.sql</p>
|
|
|
|
<p>Lets take a first look at the pg_ts_dict table</p>
|
|
<pre>
|
|
ftstest=# \d pg_ts_dict
|
|
Table "public.pg_ts_dict"
|
|
Column | Type | Modifiers
|
|
-----------------+---------+-----------
|
|
dict_name | text | not null
|
|
dict_init | oid |
|
|
dict_initoption | text |
|
|
dict_lemmatize | oid | not null
|
|
dict_comment | text |
|
|
Indexes: pg_ts_dict_idx unique btree (dict_name)
|
|
</pre>
|
|
|
|
<p>The dict_name column is the name of the dictionary, for
|
|
example 'simple', 'en_stem' or 'ru_stem'. The dict_init column
|
|
is an OID of a stored procedure to run for initialization of
|
|
that dictionary, for example 'snb_en_init' or 'snb_ru_init'.
|
|
The dict_init option is used for options passed to the init
|
|
function for the stored procedure. In the cases of 'en_stem' or
|
|
'ru_stem' it is a path to a stopword file for that dictionary,
|
|
for example '/usr/local/pgsql/share/english.stop'. This is
|
|
however dictated by the dictionary. ISpell dictionaries may
|
|
require different options. The dict_lemmatize column is another
|
|
OID of a stored procedure to the function used to lemmitize,
|
|
for example 'snb_lemmatize'. The dict_comment column is just a
|
|
comment.</p>
|
|
|
|
<p>Next we will configure the use of a new dictionary based on
|
|
ISpell. We will assume you have ISpell installed on you
|
|
machine. (in /usr/local/lib)</p>
|
|
|
|
<p>First lets register the dictionary(ies) to use from ISpell.
|
|
We will use the english dictionary from ISpell. We insert the
|
|
paths to the relevant ISpell dictionary (*.hash) and affixes
|
|
(*.aff) files. There seems to be some question as to which
|
|
ISpell files are to be used. I installed ISpell from the latest
|
|
sources on my computer. The installation installed the
|
|
dictionary files with an extension of *.hash. Some
|
|
installations install with an extension of *.dict As far as I
|
|
know the two extensions are equivilant. So *.hash ==
|
|
*.dict.</p>
|
|
|
|
<p>We will also continue to use the english word stop file that
|
|
was installed for the en_stem dictionary. You could use a
|
|
different one if you like. The ISpell configuration is based on
|
|
the "ispell_template" dictionary installed by default with
|
|
tsearch2. We will use the OIDs to the stored procedures from
|
|
the row where the dict_name = 'ispell_template'.</p>
|
|
<pre>
|
|
INSERT INTO pg_ts_dict
|
|
(SELECT 'en_ispell',
|
|
dict_init,
|
|
'DictFile="/usr/local/lib/english.hash",'
|
|
'AffFile="/usr/local/lib/english.aff",'
|
|
'StopFile="/usr/local/pgsql/share/english.stop"',
|
|
dict_lexize
|
|
FROM pg_ts_dict
|
|
WHERE dict_name = 'ispell_template');
|
|
</pre>
|
|
|
|
<p>Next we need to set up the configuration for mapping the
|
|
dictionay use to the lexxem parsings. This will be done by
|
|
altering the pg_ts_cfgmap table. We will insert several rows,
|
|
specifying to using the new dictionary we installed and
|
|
configured for use within tsearch2. There are several type of
|
|
lexims we would be concerned with forcing the use of the ISpell
|
|
dictionary.</p>
|
|
<pre>
|
|
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
|
VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
|
|
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
|
VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
|
|
INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
|
|
VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
|
|
</pre>
|
|
|
|
<p>We have just inserted 3 records to the configuration
|
|
mapping, specifying that the lexem types for "lhword,
|
|
lpart_hword and lword" are to be stemmed using the 'en_ispell'
|
|
dictionary we added into pg_ts_dict, when using the
|
|
configuration ' default_english' which we added to
|
|
pg_ts_cfg.</p>
|
|
|
|
<p>There are several other lexem types used that we do not need
|
|
to specify as using the ISpell dictionary. We can simply insert
|
|
values using the 'simple' stemming process dictionary.</p>
|
|
<pre>
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'url', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'host', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'sfloat', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'uri', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'int', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'float', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'email', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'word', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'hword', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'nlword', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'nlpart_hword', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'part_hword', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'nlhword', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'file', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'uint', '{simple}');
|
|
INSERT INTO pg_ts_cfgmap
|
|
VALUES ('default_english', 'version', '{simple}');
|
|
</pre>
|
|
|
|
<p>Our addition of a configuration for 'default_english' is now
|
|
complete. We have successfully created a new tsearch2
|
|
configuration. At the same time we have also set the new
|
|
configuration to be our default for en_US locale.</p>
|
|
<pre>
|
|
SELECT to_tsvector('default_english',
|
|
'learning tsearch is like going to school');
|
|
to_tsvector
|
|
--------------------------------------------------
|
|
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
|
|
SELECT to_tsvector('learning tsearch is like going to school');
|
|
to_tsvector
|
|
--------------------------------------------------
|
|
'go':5 'like':4 'learn':1 'school':7 'tsearch':2
|
|
(1 row)
|
|
</pre>
|
|
|
|
<p>In the case that you already have a configuration set for
|
|
the locale, and you are changing it to your new dictionary
|
|
configuration. You will have to set the old locale to NULL. If
|
|
we are using the 'C' locale then we would do this:</p>
|
|
<pre>
|
|
UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
|
|
</pre>
|
|
|
|
<p>That about wraps up the configuration of tsearch2. There is
|
|
much more you can do with the tables provided. This was just an
|
|
introduction to get things working rather quickly.</p>
|
|
|
|
<h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
|
|
|
|
<p>To aid in the addition of new dictionaries to the tsearch2
|
|
module you can use another additional module in combination
|
|
with tsearch2. The gendict module is included into tsearch2
|
|
distribution and is available from gendict/ subdirectory.</p>
|
|
|
|
<p>I will not go into detail about installation and
|
|
instructions on how to use gendict to it's fullest extent right
|
|
now. You can read the README.gendict ... it has all of the
|
|
instructions and information you will need.</p>
|
|
|
|
<h3>BACKING UP AND RESTORING DATABASES THAT FEATURE
|
|
TSEARCH2</h3>
|
|
|
|
<p>Believe it or not, this isn't as straight forward as it
|
|
should be, and you will have problems trying to backup and
|
|
restore any database which uses tsearch2 unless you take the
|
|
steps shown below. And before you ask using pg_dumpall will
|
|
result in failure every time. These took a lot of trial and
|
|
error to get working, but the process as laid down below has
|
|
been used a dozen times now in live production environments so
|
|
it should work fine.</p>
|
|
|
|
<p>HOWEVER never rely on anyone elses instructions to backup
|
|
and restore a database system, always develop and understand
|
|
your own methodology, and test it numerous times before you
|
|
need to do it for real.</p>
|
|
|
|
<p>To Backup a PostgreSQL database that uses the tsearch2
|
|
module:</p>
|
|
|
|
<p>1) Backup any global database objects such as users and
|
|
groups (this step is usually only necessary when you will be
|
|
restoring to a virgin system)</p>
|
|
<pre>
|
|
pg_dumpall -g > GLOBALobjects.sql
|
|
</pre>
|
|
|
|
<p>2) Backup the full database schema using pg_dump</p>
|
|
<pre>
|
|
pg_dump -s DATABASE > DATABASEschema.sql
|
|
</pre>
|
|
|
|
<p>3) Backup the full database using pg_dump</p>
|
|
<pre>
|
|
pg_dump -Fc DATABASE > DATABASEdata.tar
|
|
</pre>
|
|
|
|
<p>To Restore a PostgreSQL database that uses the tsearch2
|
|
module:</p>
|
|
|
|
<p>1) Create the blank database</p>
|
|
<pre>
|
|
createdb DATABASE
|
|
</pre>
|
|
|
|
<p>2) Restore any global database objects such as users and
|
|
groups (this step is usually only necessary when you will be
|
|
restoring to a virgin system)</p>
|
|
<pre>
|
|
psql DATABASE < GLOBALobjects.sql
|
|
</pre>
|
|
|
|
<p>3) Create the tsearch2 objects, functions and operators</p>
|
|
<pre>
|
|
psql DATABASE < tsearch2.sql
|
|
</pre>
|
|
|
|
<p>4) Edit the backed up database schema and delete all SQL
|
|
commands which create tsearch2 related functions, operators and
|
|
data types, BUT NOT fields in table definitions that specify
|
|
tsvector types. If your not sure what these are, they are the
|
|
ones listed in tsearch2.sql. Then restore the edited schema to
|
|
the database</p>
|
|
<pre>
|
|
psql DATABASE < DATABASEschema.sql
|
|
</pre>
|
|
|
|
<p>5) Restore the data for the database</p>
|
|
<pre>
|
|
pg_restore -N -a -d DATABASE DATABASEdata.tar
|
|
</pre>
|
|
|
|
<p>If you get any errors in step 4, it will most likely be
|
|
because you forgot to remove an object that was created in
|
|
tsearch2.sql. Any errors in step 5 will mean the database
|
|
schema was probably restored wrongly.</p>
|
|
</div>
|
|
</body>
|
|
</html>
|