tsearch2 module

2026-01-05 23:38:41 +03:00 · 2003-07-21 10:27:44 +00:00
parent a605382670
commit b88605337e
63 changed files with 16908 additions and 0 deletions
--- a/contrib/tsearch2/docs/tsearch-V2-intro.html
+++ b/contrib/tsearch2/docs/tsearch-V2-intro.html
@@ -0,0 +1,975 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+
+<html>
+<head>
+  <title>tsearch-v2-intro</title>
+<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
+</head>
+
+<body>
+  <div class="content">
+    <h2>Tsearch2 - Introduction</h2>
+
+    <p><a href=
+    "http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
+    [Online version]</a> of this document is available.</p>
+
+    <p>The tsearch2 module is available to add as an extension to
+    the PostgreSQL database to allow for Full Text Indexing. This
+    document is an introduction to installing, configuring, using
+    and maintaining the database with the tsearch2 module
+    activated.</p>
+
+    <p>Please, note, tsearch2 module is fully incompatible with old
+    tsearch, which is deprecated in 7.4 and will be obsoleted in
+    7.5.</p>
+
+    <h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
+    ENGINE</h3>
+
+    <p>This documentation is provided as a short guide on how to
+    quickly get up and running with tsearch2 and PostgreSQL, for
+    those who want to implement a full text indexed based search
+    engine. It is not meant to be a complete in-depth guide into
+    the full ins and outs of the contrib/tsearch2 module, and is
+    primarily aimed at beginners who want to speed up searching of
+    large text fields, or those migrating from other database
+    systems such as MS-SQL.</p>
+
+    <p>The README.tsearch2 file included in the contrib/tsearch2
+    directory contains a brief overview and history behind tsearch.
+    This can also be found online <a href=
+    "http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
+    here]</a>.</p>
+
+    <p>Further in depth documentation such as a full function
+    reference, and user guide can be found online at the <a href=
+    "http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
+    documentation home]</a>.</p>
+
+    <h3>ACKNOWLEDGEMENTS</h3>
+
+    <p>Robert John Shepherd originally wrote this documentation for
+    the previous version of tsearch module (v1) included with the
+    postgres release. I took his documentation and updated it to
+    comply with the tsearch2 modifications.</p>
+
+    <p>Robert's original acknowledgements:</p>
+
+    <p>"Thanks to Oleg Bartunov for taking the time to answer many
+    of my questions regarding this module, and also to Teodor
+    Sigaev for clearing up the process of making your own
+    dictionaries. Plus of course a big thanks to the pair of them
+    for writing this module in the first place!"</p>
+
+    <p>I would also like to extend my thanks to the developers, and
+    Oleg Bartunov for all of his direction and help with the new
+    features of tsearch2.</p>
+
+    <h3>OVERVIEW</h3>
+
+    <p>MS-SQL provides a full text indexing (FTI) system which
+    enables the fast searching of text based fields, very useful
+    for websites (and other applications) that require a results
+    set based on key words. PostgreSQL ships with a contributed
+    module called tsearch2, which implements a special type of
+    index that can also be used for full text indexing. Further
+    more, unlike MS' offering which requires regular incremental
+    rebuilds of the text indexes themselves, tsearch2 indexes are
+    always up-to-date and keeping them so induces very little
+    overhead.</p>
+
+    <p>Before we get into the details, it is recommended that you
+    have installed and tested PostgreSQL, are reasonably familiar
+    with databases, the SQL query language and also understand the
+    basics of connecting to PostgreSQL from the local shell. This
+    document isn't intended for the complete PostgreSQL newbie, but
+    anyone with a reasonable grasp of the basics should be able to
+    follow it.</p>
+
+    <h3>INSTALLATION</h3>
+
+    <p>Starting with PostgreSQL version 7.4 tsearch2 is now
+    included in the contrib directory with the PostgreSQL sources.
+    contrib/tsearch2 is where you will find everything needed to
+    install and use tsearch2. Please note that tsearch2 will also
+    work with PostgreSQL version 7.3.x, but it is not the module
+    included with the source distribution. You will have to
+    download the module separately and install it in the same
+    fashion.</p>
+
+    <p>I installed the tsearch2 module to a PostgreSQL 7.3 database
+    from the contrib directory without squashing the original (old)
+    tsearch module. What I did was move the modules tsearch src
+    driectory into the contrib tree under the name tsearchV2.</p>
+
+    <p>Step one is to download the tsearch V2 module :</p>
+
+    <p><a href=
+    "http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
+    (check Development History for latest stable version !)</p>
+    <pre>
+        tar -zxvf tsearch-v2.tar.gz
+        mv tsearch2 PGSQL_SRC/contrib/
+        cd PGSQL_SRC/contrib/tsearch2
+</pre>
+
+    <p>If you are installing from PostgreSQL version 7.4 or higher,
+    you can skip those steps and just change to the
+    contrib/tsearch2 directory in the source tree and continue from
+    there.</p>
+
+    <p>Then continue with the regular building and installation
+    process</p>
+    <pre>
+        gmake
+        gmake install
+        gmake installcheck
+</pre>
+
+    <p>That is pretty much all you have to do, unless of course you
+    get errors. However if you get those, you better go check with
+    the mailing lists over at <a href=
+    "http://www.postgresql.org">http://www.postgresql.org</a> or
+    <a href=
+    "http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
+    since its never failed for me.</p>
+
+    <p>The directory in the contib/ and the directory from the
+    archive is called tsearch2. Tsearch2 is completely incompatible
+    with the previous version of tsearch. This means that both
+    versions can be installed into a single database, and migration
+    the new version may be much easier.</p>
+
+    <p>NOTE: the previous version of tsearch found in the
+    contrib/tsearch directory is depricated. ALthough it is still
+    available and included within PostgreSQL version 7.4. It will
+    be removed in version 7.5.</p>
+
+    <h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
+
+    <p>We should create a database to use as an example for the
+    remainder of this file. We can call the database "ftstest". You
+    can create it from the command line like this:</p>
+    <pre>
+        #createdb ftstest
+</pre>
+
+    <p>If you thought installation was easy, this next bit is even
+    easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
+    type:</p>
+    <pre>
+        psql ftstest &lt; tsearch2.sql
+</pre>
+
+    <p>The file "tsearch2.sql" holds all the wonderful little
+    goodies you need to do full text indexing. It defines numerous
+    functions and operators, and creates the needed tables in the
+    database. There will be 4 new tables created after running the
+    tsearch2.sql file : pg_ts_dict, pg_ts_parser, pg_ts_cfg,
+    pg_ts_cfgmap are added.</p>
+
+    <p>You can check out the tables if you like:</p>
+    <pre>
+        #psql ftstest
+        ftstest=# \d
+                    List of relations
+         Schema |     Name     | Type  |  Owner
+        --------+--------------+-------+----------
+         public | pg_ts_cfg    | table | kopciuch
+         public | pg_ts_cfgmap | table | kopciuch
+         public | pg_ts_dict   | table | kopciuch
+         public | pg_ts_parser | table | kopciuch
+        (4 rows)
+</pre>
+
+    <h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
+
+    <p>The first thing we can do is try out some of the types that
+    are provided for us. Lets look at the tsvector type provided
+    for us:</p>
+    <pre>
+        SELECT 'Our first string used today'::tsvector;
+                        tsvector
+        ---------------------------------------
+         'Our' 'used' 'first' 'today' 'string'
+        (1 row)
+</pre>
+
+    <p>The results are the words used within our string. Notice
+    they are not in any particular order. The tsvector type returns
+    a string of space separated words.</p>
+    <pre>
+        SELECT 'Our first string used today first string'::tsvector;
+                            tsvector
+        -----------------------------------------------
+         'Our' 'used' 'again' 'first' 'today' 'string'
+        (1 row)
+</pre>
+
+    <p>Notice the results string has each unique word ('first' and
+    'string' only appear once in the tsvector value). Which of
+    course makes sense if you are searching the full text ... you
+    only need to know each unique word in the text.</p>
+
+    <p>Those examples were just casting a text field to that of
+    type tsvector. Lets check out one of the new functions created
+    by the tsearch2 module.</p>
+
+    <p>The function to_tsvector has 3 possible signatures:</p>
+    <pre>
+        to_tsvector(oid, text);
+        to_tsvector(text, text);
+        to_tsvector(text);
+</pre>
+
+    <p>We will use the second method using two text fields. The
+    overloaded methods provide us with a way to specifiy the way
+    the searchable text is broken up into words (Stemming process).
+    Right now we will specify the 'default' configuration. See the
+    section on TSEARCH2 CONFIGURATION to learn more about this.</p>
+    <pre>
+        SELECT to_tsvector('default',
+                           'Our first string used today first string');
+                        to_tsvector
+        --------------------------------------------
+         'use':4 'first':2,6 'today':5 'string':3,7
+        (1 row)
+</pre>
+
+    <p>The result returned from this function is of type tsvector.
+    The results came about by this reasoning: All of the words in
+    the text passed in are stemmed, or not used because they are
+    stop words defined in our configuration. Each lower case
+    morphed word is returned with all of the positons in the
+    text.</p>
+
+    <p>In this case the word "Our" is a stop word in the default
+    configuration. That means it will not be included in the
+    result. The word "first" is found at positions 2 and 6
+    (although "Our" is a stop word, it's position is maintained).
+    The word(s) positioning is maintained exactly as in the
+    original string. The word "used" is morphed to the word "use"
+    based on the default configuration for word stemming, and is
+    found at position 4. The rest of the results follow the same
+    logic. Just a reminder again ... the order of the 'word'
+    position in the output is not in any kind of order. (ie 'use':4
+    appears first)</p>
+
+    <p>If you want to view the output of the tsvector fields
+    without their positions, you can do so with the function
+    "strip(tsvector)".</p>
+    <pre>
+        SELECT strip(to_tsvector('default',
+                     'Our first string used today first string'));
+                    strip
+        --------------------------------
+         'use' 'first' 'today' 'string'
+</pre>
+
+    <p>If you wish to know the number of unique words returned in
+    the tsvector you can do so by using the function
+    "length(tsvector)"</p>
+    <pre>
+        SELECT length(to_tsvector('default',
+                      'Our first string used today first string'));
+         length
+        --------
+              4
+        (1 row)
+</pre>
+
+    <p>Lets take a look at the function to_tsquery. It also has 3
+    signatures which follow the same rational as the to_tsvector
+    function:</p>
+    <pre>
+        to_tsquery(oid, text);
+        to_tsquery(text, text);
+        to_tsquery(text);
+</pre>
+
+    <p>Lets try using the function with a single word :</p>
+    <pre>
+        SELECT to_tsquery('default', 'word');
+         to_tsquery
+        -----------
+         'word'
+         (1 row)
+</pre>
+
+    <p>I call the function the same way I would a to_tsvector
+    function, specifying the 'default' configuration for morphing,
+    and the result is the stemmed output 'word'.</p>
+
+    <p>Lets attempt to use the function with a string of multiple
+    words:</p>
+    <pre>
+        SELECT to_tsquery('default', 'this is many words');
+        ERROR:  Syntax error
+</pre>
+
+    <p>The function can not accept a space separated string. The
+    intention of the to_tsquery function is to return a type of
+    "tsquery" used for searching a tsvector field. What we need to
+    do is search for one to many words with some kind of logic (for
+    now simple boolean).</p>
+    <pre>
+        SELECT to_tsquery('default', 'searching|sentence');
+              to_tsquery
+        ----------------------
+         'search' | 'sentenc'
+        (1 row)
+</pre>
+
+    <p>Notice that the words are separated by the boolean logic
+    "OR", the text could contain boolean operators &amp;,|,!,()
+    with their usual meaning.</p>
+
+    <p>You can not use words defined as being a stop word in your
+    configuration. The function will not fail ... you will just get
+    no result, and a NOTICE like this:</p>
+    <pre>
+        SELECT to_tsquery('default', 'a|is&amp;not|!the');
+        NOTICE:  Query contains only stopword(s)
+                 or doesn't contain lexem(s), ignored
+         to_tsquery
+        -----------
+        (1 row)
+</pre>
+
+    <p>That is a beginning to using the types, and functions
+    defined in the tsearch2 module. There are numerous more
+    functions that I have not touched on. You can read through the
+    tsearch2.sql file built when compiling to get more familiar
+    with what is included.</p>
+
+    <h3>INDEXING FIELDS IN A TABLE</h3>
+
+    <p>The next stage is to add a full text index to an existing
+    table. In this example we already have a table defined as
+    follows:</p>
+    <pre>
+        CREATE TABLE tblMessages
+        (
+                intIndex        int4,
+                strTopic        varchar(100),
+                strMessage      text
+        );
+</pre>
+
+    <p>We are assuming there are several rows with some kind of
+    data in them. Any data will do, just do several inserts with
+    test strings for a topic, and a message. here is some test data
+    I inserted. (yes I know it's completely useless stuff ;-) but
+    it will serve our purpose right now).</p>
+    <pre>
+        INSERT INTO tblMessages
+               VALUES ('1', 'Testing Topic', 'Testing message data input');
+        INSERT INTO tblMessages
+               VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
+        INSERT INTO tblMessages
+               VALUES ('3', 'Famous Author', 'Stephen King');
+        INSERT INTO tblMessages
+               VALUES ('4', 'Political Topic',
+                            'Nelson Mandella is released from prison');
+        INSERT INTO tblMessages
+               VALUES ('5', 'Nursery rhyme phrase',
+                            'Little jack horner sat in a corner');
+        INSERT INTO tblMessages
+               VALUES ('6', 'Gettysburg address quotation',
+                            'Four score and seven years ago'
+                            ' our fathers brought forth on this'
+                            ' continent a new nation, conceived in'
+                            ' liberty and dedicated to the proposition'
+                            ' that all men are created equal');
+        INSERT INTO tblMessages
+               VALUES ('7', 'Classic Rock Bands',
+                            'Led Zeppelin Grateful Dead and The Sex Pistols');
+        INSERT INTO tblMessages
+               VALUES ('8', 'My birth address',
+                            '18 Sommervile road, Regina, Saskatchewan');
+        INSERT INTO tblMessages
+               VALUES ('9', 'Joke', 'knock knock : who\'s there?'
+                                    ' I will not finish this joke');
+        INSERT INTO tblMessages
+               VALUES ('10', 'Computer information',
+                             'My computer is a pentium III 400 mHz'
+                             ' with 192 megabytes of RAM');
+</pre>
+
+    <p>The next stage is to create a special text index which we
+    will use for FTI, so we can search our table of messages for
+    words or a phrase. We do this using the SQL command:</p>
+    <pre>
+        ALTER TABLE tblMessages ADD idxFTI tsvector;
+</pre>
+
+    <p>Note that unlike traditional indexes, this is actually a new
+    field in the same table, which is then used (through the magic
+    of the tsearch2 operators and functions) by a special index we
+    will create in a moment.</p>
+
+    <p>The general rule for the initial insertion of data will
+    follow four steps:</p>
+    <pre>
+    1. update table
+    2. vacuum full analyze
+    3. create index
+    4. vacuum full analyze
+</pre>
+
+    <p>The data can be updated into the table, the vacuum full
+    analyze will reclaim unused space. The index can be created on
+    the table after the data has been inserted. Having the index
+    created prior to the update will slow down the process. It can
+    be done in that manner, this way is just more efficient. After
+    the index has been created on the table, vacuum full analyze is
+    run again to update postgres's statistics (ie having the index
+    take effect).</p>
+    <pre>
+        UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
+        VACUUM FULL ANALYZE;
+</pre>
+
+    <p>Note that this only inserts the field strMessage as a
+    tsvector, so if you want to also add strTopic to the
+    information stored, you should instead do the following, which
+    effectively concatenates the two fields into one before being
+    inserted into the table:</p>
+    <pre>
+        UPDATE tblMessages
+            SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
+        VACUUM FULL ANALYZE;
+</pre>
+
+    <p><strong>Using the coalesce function makes sure this
+    concatenation also works with NULL fields.</strong></p>
+
+    <p>We need to create the index on the column idxFTI. Keep in
+    mind that the database will update the index when some action
+    is taken. In this case we _need_ the index (The whole point of
+    Full Text INDEXINGi ;-)), so don't worry about any indexing
+    overhead. We will create an index based on the gist function.
+    GiST is an index structure for Generalized Search Tree.</p>
+    <pre>
+        CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
+        VACUUM FULL ANALYZE;
+</pre>
+
+    <p>After you have converted all of your data and indexed the
+    column, you can select some rows to see what actually happened.
+    I will not display output here but you can play around
+    yourselves and see what happened.</p>
+
+    <p>The last thing to do is set up a trigger so every time a row
+    in this table is changed, the text index is automatically
+    updated. This is easily done using:</p>
+    <pre>
+        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
+            FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
+</pre>
+
+    <p>Or if you are indexing both strMessage and strTopic you
+    should instead do:</p>
+    <pre>
+        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
+            FOR EACH ROW EXECUTE PROCEDURE
+                tsearch2(idxFTI, strTopic, strMessage);
+</pre>
+
+    <p>Before you ask, the tsearch2 function accepts multiple
+    fields as arguments so there is no need to concatenate the two
+    into one like we did before.</p>
+
+    <p>If you want to do something specific with columns, you may
+    write your very own trigger function using plpgsql or other
+    procedural languages (but not SQL, unfortunately) and use it
+    instead of <em>tsearch2</em> trigger.</p>
+
+    <p>You could however call other stored procedures from within
+    the tsearch2 function. Lets say we want to create a function to
+    remove certain characters (like the @ symbol from all
+    text).</p>
+    <pre>
+       CREATE FUNCTION dropatsymbol(text) 
+                     RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
+</pre>
+
+    <p>Now we can use this function within the tsearch2 function on
+    the trigger.</p>
+    <pre>
+      DROP TRIGGER tsvectorupdate ON tblmessages;
+        CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
+            FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
+        INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
+</pre>
+
+    <p>If at this point you receive an error stating: ERROR: Can't
+    find tsearch config by locale</p>
+
+    <p>Do not worry. You have done nothing wrong. And tsearch2 is
+    not broken. All that has happened here is that the
+    configuration is setup to use a configuration based on the
+    locale of the server. All you have to do is change your default
+    configuration, or add a new one for your specific locale. See
+    the section on TSEARCH2 CONFIGURATION.</p>
+    <pre class="real">
+   SELECT * FROM tblmessages WHERE intindex = 69;
+
+         intindex |         strtopic         |  strmessage   |        idxfti
+        ----------+--------------------------+---------------+-----------------------   
+                69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
+        (1 row)
+</pre>Notice that the string content was passed throught the stored
+procedure dropatsymbol. The '@' character was replaced with a
+single space ... and the output from the procedure was then stored
+in the tsvector column.
+
+    <p>This could be useful for removing other characters from
+    indexed text, or any kind of preprocessing needed to be done on
+    the text prior to insertion into the index.</p>
+
+    <h3>QUERYING A TABLE</h3>
+
+    <p>There are some examples in the README.tsearch2 file for
+    querying a table. One major difference between tsearch and
+    tsearch2 is the operator ## is no longer available. Only the
+    operator @@ is defined, using the types tsvector on one side
+    and tsquery on the other side.</p>
+
+    <p>Lets search the indexed data for the word "Test". I indexed
+    based on the the concatenation of the strTopic, and the
+    strMessage:</p>
+    <pre>
+        SELECT intindex, strtopic FROM tblmessages
+                                  WHERE idxfti @@ 'test'::tsquery;
+         intindex |   strtopic
+        ----------+---------------
+                1 | Testing Topic
+        (1 row)
+</pre>
+
+    <p>The only result that matched was the row with a topic
+    "Testing Topic". Notice that the word I search for was all
+    lowercase. Let's see what happens when I query for uppercase
+    "Test".</p>
+    <pre>
+        SELECT intindex, strtopic FROM tblmessages
+                                  WHERE idxfti @@ 'Test'::tsquery;
+         intindex | strtopic
+        ----------+----------
+        (0 rows)
+</pre>
+
+    <p>We get zero rows returned. The reason is because when the
+    text was inserted, it was morphed to my default configuration
+    (because of the call to to_tsvector in the UPDATE statement).
+    If there was no morphing done, and the tsvector field(s)
+    contained the word 'Text', a match would have been found.</p>
+
+    <p>Most likely the best way to query the field is to use the
+    to_tsquery function on the right hand side of the @@ operator
+    like this:</p>
+    <pre>
+        SELECT intindex, strtopic FROM tblmessages
+               WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
+         intindex |      strtopic
+        ----------+--------------------
+                1 | Testing Topic
+                7 | Classic Rock Bands
+        (2 rows)
+</pre>
+
+    <p>That query searched for all instances of "Test" OR
+    "Zeppelin". It returned two rows: the "Testing Topic" row, and
+    the "Classic Rock Bands" row. The to_tsquery function performed
+    the correct morphology upon the parameters, and searched the
+    tsvector field appropriately.</p>
+
+    <p>The last example here relates to searching for a phrase, for
+    example "minority report". This poses a problem with regard to
+    tsearch2, as it doesn't index phrases, only words. But there is
+    a way around which doesn't appear to have a significant impact
+    on query time, and that is to use a query such as the
+    following:</p>
+    <pre>
+        SELECT intindex, strTopic FROM tblmessages
+                WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address')
+                AND strMessage ~* '.*men are created equal.*';
+         intindex |           strtopic
+        ----------+------------------------------
+                6 | Gettysburg address quotation
+        (1 row)
+        SELECT intindex, strTopic FROM tblmessages
+                WHERE idxfti @@ to_tsquery('default', 'gettysburg &amp; address')
+                AND strMessage ~* '.*something that does not exist.*';
+         intindex | strtopic
+        ----------+----------
+        (0 rows)
+</pre>
+
+    <p>Of course if your indexing both strTopic and strMessage, and
+    want to search for this phrase on both, then you will have to
+    get out the brackets and extend this query a little more.</p>
+
+    <h3>TSEARCH2 CONFIGURATION</h3>
+
+    <p>Some words such as "and", "the", and "who" are automatically
+    not indexed, since they belong to a pre-existing dictionary of
+    "Stop Words" which tsearch2 does not perform indexing on. If
+    someone needs to search for "The Who" in your database, they
+    are going to have a tough time coming up with any results,
+    since both are ignored in the indexes. But there is a
+    solution.</p>
+
+    <p>Lets say we want to add a word into the stop word list for
+    english stemming. We could edit the file
+    :'/usr/local/pgsql/share/english.stop' and add a word to the
+    list. I edited mine to exclude my name from indexing:</p>
+    <pre>
+    - Edit /usr/local/pgsql/share/english.stop
+    - Add 'andy' to the list
+    - Save the file.
+</pre>
+
+    <p>When you connect to the database, the dict_init procedure is
+    run during initialization. And in my configuration it will read
+    the stop words from the file I just edited. If you were
+    connected to the DB while editing the stop words, you will need
+    to end the current session and re-connect. When you re-connect
+    to the database, 'andy' is no longer indexed:</p>
+    <pre>
+        SELECT to_tsvector('default', 'Andy');
+         to_tsvector
+        ------------
+        (1 row)
+</pre>
+
+    <p>Originally I would get the result :</p>
+    <pre>
+        SELECT to_tsvector('default', 'Andy');
+         to_tsvector
+        ------------
+         'andi':1
+        (1 row)
+</pre>
+
+    <p>But since I added it as a stop word, it would be ingnored on
+    the indexing. The stop word added was used in the dictionary
+    "en_stem". If I were to use a different configuration such as
+    'simple', the results would be different. There are no stop
+    words for the simple dictionary. It will just convert to lower
+    case, and index every unique word.</p>
+    <pre>
+        SELECT to_tsvector('simple', 'Andy andy The the in out');
+                     to_tsvector
+        -------------------------------------
+         'in':5 'out':6 'the':3,4 'andy':1,2
+        (1 row)
+</pre>
+
+    <p>All this talk about which configuration to use is leading us
+    into the actual configuration of tsearch2. In the examples in
+    this document the configuration has always been specified when
+    using the tsearch2 functions:</p>
+    <pre>
+        SELECT to_tsvector('default', 'Testing the default config');
+        SELECT to_tsvector('simple', 'Example of simple Config');
+</pre>
+
+    <p>The pg_ts_cfg table holds each configuration you can use
+    with the tsearch2 functions. As you can see the ts_name column
+    contains both the 'default' configurations based on the 'C'
+    locale. And the 'simple' configuration which is not based on
+    any locale.</p>
+    <pre>
+        SELECT * from pg_ts_cfg;
+             ts_name     | prs_name |    locale
+        -----------------+----------+--------------
+         default         | default  | C
+         default_russian | default  | ru_RU.KOI8-R
+         simple          | default  |
+        (3 rows)
+</pre>
+
+    <p>Each row in the pg_ts_cfg table contains the name of the
+    tsearch2 configuration, the name of the parser to use, and the
+    locale mapped to the configuration. There is only one parser to
+    choose from the table pg_ts_parser called 'default'. More
+    parsers could be written, but for our needs we will use the
+    default.</p>
+
+    <p>There are 3 configurations installed by tsearch2 initially.
+    If your locale is set to 'en_US' for example (like my laptop),
+    then as you can see there is currently no dictionary configured
+    to use with that locale. You can either set up a new
+    configuration or just use one that already exists. If I do not
+    specify which configuration to use in the to_tsvector function,
+    I receive the following error.</p>
+    <pre>
+        SELECT to_tsvector('learning tsearch is like going to school');
+        ERROR:  Can't find tsearch config by locale
+</pre>
+
+    <p>We will create a new configuration for use with the server
+    encoding 'en_US'. The first step is to add a new configuration
+    into the pg_ts_cfg table. We will call the configuration
+    'default_english', with the default parser and use the locale
+    'en_US'.</p>
+    <pre>
+        INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
+               VALUES ('default_english', 'default', 'en_US');
+</pre>
+
+    <p>We have only declared that there is a configuration called
+    'default_english'. We need to set the configuration of how
+    'default_english' will work. The next step is creating a new
+    dictionary to use. The configuration of the dictionary is
+    completlely different in tsearch2. In the prior versions to
+    make changes, you would have to re-compile your changes into
+    the tsearch.so. All of the configuration has now been moved
+    into the system tables created by executing the SQL code from
+    tsearch2.sql</p>
+
+    <p>Lets take a first look at the pg_ts_dict table</p>
+    <pre>
+        ftstest=# \d pg_ts_dict
+                Table "public.pg_ts_dict"
+         Column      |  Type   | Modifiers
+        -----------------+---------+-----------
+         dict_name       | text    | not null
+         dict_init       | oid     |
+         dict_initoption | text    |
+         dict_lemmatize  | oid     | not null
+         dict_comment    | text    |
+        Indexes: pg_ts_dict_idx unique btree (dict_name)
+</pre>
+
+    <p>The dict_name column is the name of the dictionary, for
+    example 'simple', 'en_stem' or 'ru_stem'. The dict_init column
+    is an OID of a stored procedure to run for initialization of
+    that dictionary, for example 'snb_en_init' or 'snb_ru_init'.
+    The dict_init option is used for options passed to the init
+    function for the stored procedure. In the cases of 'en_stem' or
+    'ru_stem' it is a path to a stopword file for that dictionary,
+    for example '/usr/local/pgsql/share/english.stop'. This is
+    however dictated by the dictionary. ISpell dictionaries may
+    require different options. The dict_lemmatize column is another
+    OID of a stored procedure to the function used to lemmitize,
+    for example 'snb_lemmatize'. The dict_comment column is just a
+    comment.</p>
+
+    <p>Next we will configure the use of a new dictionary based on
+    ISpell. We will assume you have ISpell installed on you
+    machine. (in /usr/local/lib)</p>
+
+    <p>First lets register the dictionary(ies) to use from ISpell.
+    We will use the english dictionary from ISpell. We insert the
+    paths to the relevant ISpell dictionary (*.hash) and affixes
+    (*.aff) files. There seems to be some question as to which
+    ISpell files are to be used. I installed ISpell from the latest
+    sources on my computer. The installation installed the
+    dictionary files with an extension of *.hash. Some
+    installations install with an extension of *.dict As far as I
+    know the two extensions are equivilant. So *.hash ==
+    *.dict.</p>
+
+    <p>We will also continue to use the english word stop file that
+    was installed for the en_stem dictionary. You could use a
+    different one if you like. The ISpell configuration is based on
+    the "ispell_template" dictionary installed by default with
+    tsearch2. We will use the OIDs to the stored procedures from
+    the row where the dict_name = 'ispell_template'.</p>
+    <pre>
+        INSERT INTO pg_ts_dict
+               (SELECT 'en_ispell',
+                       dict_init,
+                       'DictFile="/usr/local/lib/english.hash",'
+                       'AffFile="/usr/local/lib/english.aff",'
+                       'StopFile="/usr/local/pgsql/share/english.stop"',
+                       dict_lexize
+                FROM pg_ts_dict
+                WHERE dict_name = 'ispell_template');
+</pre>
+
+    <p>Next we need to set up the configuration for mapping the
+    dictionay use to the lexxem parsings. This will be done by
+    altering the pg_ts_cfgmap table. We will insert several rows,
+    specifying to using the new dictionary we installed and
+    configured for use within tsearch2. There are several type of
+    lexims we would be concerned with forcing the use of the ISpell
+    dictionary.</p>
+    <pre>
+        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
+               VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
+        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
+               VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
+        INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
+               VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
+</pre>
+
+    <p>We have just inserted 3 records to the configuration
+    mapping, specifying that the lexem types for "lhword,
+    lpart_hword and lword" are to be stemmed using the 'en_ispell'
+    dictionary we added into pg_ts_dict, when using the
+    configuration ' default_english' which we added to
+    pg_ts_cfg.</p>
+
+    <p>There are several other lexem types used that we do not need
+    to specify as using the ISpell dictionary. We can simply insert
+    values using the 'simple' stemming process dictionary.</p>
+    <pre>
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'url', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'host', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'sfloat', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'uri', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'int', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'float', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'email', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'word', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'hword', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'nlword', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'nlpart_hword', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'part_hword', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'nlhword', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'file', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'uint', '{simple}');
+        INSERT INTO pg_ts_cfgmap
+               VALUES ('default_english', 'version', '{simple}');
+</pre>
+
+    <p>Our addition of a configuration for 'default_english' is now
+    complete. We have successfully created a new tsearch2
+    configuration. At the same time we have also set the new
+    configuration to be our default for en_US locale.</p>
+    <pre>
+        SELECT to_tsvector('default_english',
+                           'learning tsearch is like going to school');
+                           to_tsvector
+        --------------------------------------------------
+         'go':5 'like':4 'learn':1 'school':7 'tsearch':2
+        SELECT to_tsvector('learning tsearch is like going to school');
+                            to_tsvector
+        --------------------------------------------------
+         'go':5 'like':4 'learn':1 'school':7 'tsearch':2
+        (1 row)
+</pre>
+
+    <p>In the case that you already have a configuration set for
+    the locale, and you are changing it to your new dictionary
+    configuration. You will have to set the old locale to NULL. If
+    we are using the 'C' locale then we would do this:</p>
+    <pre>
+        UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
+</pre>
+
+    <p>That about wraps up the configuration of tsearch2. There is
+    much more you can do with the tables provided. This was just an
+    introduction to get things working rather quickly.</p>
+
+    <h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
+
+    <p>To aid in the addition of new dictionaries to the tsearch2
+    module you can use another additional module in combination
+    with tsearch2. The gendict module is included into tsearch2
+    distribution and is available from gendict/ subdirectory.</p>
+
+    <p>I will not go into detail about installation and
+    instructions on how to use gendict to it's fullest extent right
+    now. You can read the README.gendict ... it has all of the
+    instructions and information you will need.</p>
+
+    <h3>BACKING UP AND RESTORING DATABASES THAT FEATURE
+    TSEARCH2</h3>
+
+    <p>Believe it or not, this isn't as straight forward as it
+    should be, and you will have problems trying to backup and
+    restore any database which uses tsearch2 unless you take the
+    steps shown below. And before you ask using pg_dumpall will
+    result in failure every time. These took a lot of trial and
+    error to get working, but the process as laid down below has
+    been used a dozen times now in live production environments so
+    it should work fine.</p>
+
+    <p>HOWEVER never rely on anyone elses instructions to backup
+    and restore a database system, always develop and understand
+    your own methodology, and test it numerous times before you
+    need to do it for real.</p>
+
+    <p>To Backup a PostgreSQL database that uses the tsearch2
+    module:</p>
+
+    <p>1) Backup any global database objects such as users and
+    groups (this step is usually only necessary when you will be
+    restoring to a virgin system)</p>
+    <pre>
+        pg_dumpall -g &gt; GLOBALobjects.sql
+</pre>
+
+    <p>2) Backup the full database schema using pg_dump</p>
+    <pre>
+        pg_dump -s DATABASE &gt; DATABASEschema.sql
+</pre>
+
+    <p>3) Backup the full database using pg_dump</p>
+    <pre>
+        pg_dump -Fc DATABASE &gt; DATABASEdata.tar
+</pre>
+
+    <p>To Restore a PostgreSQL database that uses the tsearch2
+    module:</p>
+
+    <p>1) Create the blank database</p>
+    <pre>
+        createdb DATABASE
+</pre>
+
+    <p>2) Restore any global database objects such as users and
+    groups (this step is usually only necessary when you will be
+    restoring to a virgin system)</p>
+    <pre>
+        psql DATABASE &lt; GLOBALobjects.sql
+</pre>
+
+    <p>3) Create the tsearch2 objects, functions and operators</p>
+    <pre>
+        psql DATABASE &lt; tsearch2.sql
+</pre>
+
+    <p>4) Edit the backed up database schema and delete all SQL
+    commands which create tsearch2 related functions, operators and
+    data types, BUT NOT fields in table definitions that specify
+    tsvector types. If your not sure what these are, they are the
+    ones listed in tsearch2.sql. Then restore the edited schema to
+    the database</p>
+    <pre>
+        psql DATABASE &lt; DATABASEschema.sql
+</pre>
+
+    <p>5) Restore the data for the database</p>
+    <pre>
+        pg_restore -N -a -d DATABASE DATABASEdata.tar
+</pre>
+
+    <p>If you get any errors in step 4, it will most likely be
+    because you forgot to remove an object that was created in
+    tsearch2.sql. Any errors in step 5 will mean the database
+    schema was probably restored wrongly.</p>
+  </div>
+</body>
+</html>
--- a/contrib/tsearch2/docs/tsearch2-guide.html
+++ b/contrib/tsearch2/docs/tsearch2-guide.html
--- a/contrib/tsearch2/docs/tsearch2-ref.html
+++ b/contrib/tsearch2/docs/tsearch2-ref.html
@@ -0,0 +1,448 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<html>
+<head>
+<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
+<title>tsearch2 reference</title>
+</head>
+<body>
+<h1 align=center>The tsearch2 Reference</h1>
+
+<p align=center>
+Brandon Craig Rhodes<br>30 June 2003
+<p>
+This Reference documents the user types and functions
+of the tsearch2 module for PostgreSQL.
+An introduction to the module is provided
+by the <a href="tsearch2-guide.html">tsearch2 Guide</a>,
+a companion document to this one.
+You can retrieve a beta copy of the tsearch2 module from the
+<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
+page &mdash; look under the section entitled <i>Development History</i>
+for the current version.
+
+<h2><a name="vq">Vectors and Queries</h2>
+
+Vectors and queries both store lexemes,
+but for different purposes.
+A <tt>tsvector</tt> stores the lexemes
+of the words that are parsed out of a document,
+and can also remember the position of each word.
+A <tt>tsquery</tt> specifies a boolean condition among lexemes.
+<p>
+Any of the following functions with a <tt><i>configuration</i></tt> argument
+can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
+to select a configuration;
+if the option is omitted, then the current configuration is used.
+For more information on the current configuration,
+read the next section on Configurations.
+
+<h3>Vector Operations</h3>
+
+<dl>
+<dt>
+ <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>document</i> TEXT) RETURNS tsvector</tt>
+<dd>
+ Parses a document into tokens,
+ reduces the tokens to lexemes,
+ and returns a <tt>tsvector</tt> which lists the lexemes
+ together with their positions in the document.
+ For the best description of this process,
+ see the section on <a href="tsearch2-guide.html#ps">Parsing and Stemming</a>
+ in the accompanying tsearch2 Guide.
+<dt>
+ <tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
+<dd>
+ Return a vector which lists the same lexemes
+ as the given <tt><i>vector</i></tt>,
+ but which lacks any information
+ about where in the document each lexeme appeared.
+ While the returned vector is thus useless for relevance ranking,
+ it will usually be much smaller.
+<dt>
+ <tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
+<dd>
+ This function returns a copy of the input vector
+ in which every location has been labelled
+ with either the <tt><i>letter</i></tt>
+ <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
+ or the default label <tt>'D'</tt>
+ (which is the default with which new vectors are created,
+ and as such is usually not displayed).
+ These labels are retained when vectors are concatenated,
+ allowing words from different parts of a document
+ to be weighted differently by ranking functions.
+<dt>
+ <tt><i>vector1</i> || <i>vector2</i></tt>
+<dt class=br>
+ <tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
+ RETURNS tsvector</tt>
+<dd>
+ Returns a vector which combines the lexemes and position information
+ in the two vectors given as arguments.
+ Position weight labels (described in the previous paragraph)
+ are retained intact during the concatenation.
+ This has at least two uses.
+ First,
+ if some sections of your document
+ need be parsed with different configurations than others,
+ you can parse them separately
+ and concatenate the resulting vectors into one.
+ Second,
+ you can weight words from some sections of you document
+ more heavily than those from others by:
+ parsing the sections into separate vectors;
+ assigning the vectors different position labels
+ with the <tt>setweight()</tt> function;
+ concatenating them into a single vector;
+ and then providing a <tt><i>weights</i></tt> argument
+ to the <tt>rank()</tt> function
+ that assigns different weights to positions with different labels.
+<dt>
+ <tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
+<dd>
+ Returns the number of lexemes stored in the vector.
+<dt>
+ <tt><i>text</i>::tsvector RETURNS tsvector</tt>
+<dd>
+ Directly casting text to a <tt>tsvector</tt>
+ allows you to directly inject lexemes into a vector,
+ with whatever positions and position weights you choose to specify.
+ The <tt><i>text</i></tt> should be formatted
+ like the vector would be printed by the output of a <tt>SELECT</tt>.
+ See the <a href="tsearch2-guide.html#casting">Casting</a>
+ section in the Guide for details.
+</dl>
+
+<h3>Query Operations</h3>
+
+<dl>
+<dt>
+ <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>querytext</i> text) RETURNS tsvector</tt>
+<dd>
+ Parses a query,
+ which should be single words separated by the boolean operators
+ &ldquo;<tt>&amp;</tt>&rdquo;&nbsp;and,
+ &ldquo;<tt>|</tt>&rdquo;&nbsp;or,
+ and&nbsp;&ldquo;<tt>!</tt>&rdquo;&nbsp;not,
+ which can be grouped using parenthesis.
+ Each word is reduced to a lexeme using the current
+ or specified configuration.
+</ul>
+<dt>
+ <tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
+<dd>
+ This might return a textual representation of the given query.
+<dt>
+ <tt><i>text</i>::tsquery RETURNS tsquery</tt>
+<dd>
+ Directly casting text to a <tt>tsquery</tt>
+ allows you to directly inject lexemes into a query,
+ with whatever positions and position weight flags you choose to specify.
+ The <tt><i>text</i></tt> should be formatted
+ like the query would be printed by the output of a <tt>SELECT</tt>.
+ See the <a href="tsearch2-guide.html#casting">Casting</a>
+ section in the Guide for details.
+</dl>
+
+<h2><a name="configurations">Configurations</a></h2>
+
+A configuration specifies all of the equipment necessary
+to transform a document into a <tt>tsvector</tt>:
+the parser that breaks its text into tokens,
+and the dictionaries which then transform each token into a lexeme.
+Every call to <tt>to_tsvector()</tt> (described above)
+uses a configuration to perform its processing.
+Three configurations come with tsearch2:
+
+<ul>
+<li><b>default</b> &mdash; Indexes words and numbers,
+ using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
+ and the <i>simple</i> dictionary for all others.
+<li><b>default_russian</b> &mdash; Indexes words and numbers,
+ using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
+ and the <i>ru_stem</i> Russian Snowball dictionary for all others.
+<li><b>simple</b> &mdash; Processes both words and numbers
+ with the <i>simple</i> dictionary,
+ which neither discards any stop words nor alters them.
+</ul>
+
+The tsearch2 modules initially chooses your current configuration
+by looking for your current locale in the <tt>locale</tt> field
+of the <tt>pg_ts_cfg</tt> table described below.
+You can manipulate the current configuration yourself with these functions:
+
+<dl>
+<dt>
+ <tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
+  ) RETURNS VOID</tt>
+<dd>
+ Set the current configuration used by <tt>to_tsvector</tt>
+ and <tt>to_tsquery</tt>.
+<dt>
+ <tt>show_curcfg() RETURNS INT4</tt>
+<dd>
+ Returns the integer <tt>id</tt> of the current configuration.
+</dl>
+
+<p>
+Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:
+
+<pre>create table pg_ts_cfg (
+	id		int not  null primary key,
+	ts_name		text not null,
+	prs_name	text not null,
+	locale		text
+);</pre>
+
+The <tt>id</tt> and <tt>ts_name</tt> are unique values
+which identify the configuration;
+the <tt>prs_name</tt> specifies which parser the configuration uses.
+Once this parser has split document text into tokens,
+the type of each resulting token &mdash;
+or, more specifically, the type's <tt>lex_alias</tt>
+as specified in the parser's <tt>lexem_type()</tt> table &mdash;
+is searched for together with the configuration's <tt>ts_name</tt>
+in the <tt>pg_ts_cfgmap</tt> table:
+
+<pre>create table pg_ts_cfgmap (
+	ts_name		text not null,
+	lex_alias	text not null,
+	dict_name	text[],
+	primary key (ts_name,lex_alias)
+);</pre>
+
+Those tokens whose types are not listed are discarded.
+The remaining tokens are assigned integer positions,
+starting with 1 for the first token in the document,
+and turned into lexemes with the help of the dictionaries
+whose names are given in the <tt>dict_name</tt> array for their type.
+These dictionaries are tried in order,
+stopping either with the first one to return a lexeme for the token,
+or discarding the token if no dictionary returns a lexeme for it.
+
+<h2><a name="dictionaries">Parsers</a></h2>
+
+Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:
+
+<pre>create table pg_ts_parser (
+	prs_id		int not null primary key,
+	prs_name	text not null,
+	prs_start	oid not null,
+	prs_getlexem	oid not null,
+	prs_end		oid not null,
+	prs_headline	oid not null,
+	prs_lextype	oid not null,
+	prs_comment	text
+);</pre>
+
+The <tt>prs_id</tt> and <tt>prs_name</tt> uniquely identify the parser,
+while <tt>prs_comment</tt> usually describes its name and version
+for the reference of users.
+The other items identify the low-level functions
+which make the parser operate,
+and are only of interest to someone writing a parser of their own.
+<p>
+The tsearch2 module comes with one parser named <tt>default</tt>
+which is suitable for parsing most plain text and HTML documents.
+<p>
+Each <tt><i>parser</i></tt> argument below
+must designate a parser with either an integer <tt><i>prs_id</i></tt>
+or a textual <tt><i>prs_name</i></tt>;
+the current parser is used when this argument is omitted.
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
+<dd>
+ Selects a current parser
+ which will be used when any of the following functions
+ are called without a parser as an argument.
+<dt>
+ <tt>CREATE FUNCTION lexem_type(
+  <em>[</em> <i>parser</i> <em>]</em>
+  ) RETURNS SETOF lexemtype</tt>
+<dd>
+ Returns a table which defines and describes
+ each kind of token the parser may produce as output.
+ For each token type the table gives the <tt>lexid</tt>
+ which the parser will label each token of that type,
+ the <tt>alias</tt> which names the token type,
+ and a short description <tt>descr</tt> for the user to read.
+<dt>
+ <tt>CREATE FUNCTION parse(
+  <em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
+  ) RETURNS SETOF lexemtype</tt>
+<dd>
+ Parses the given document and returns a series of records,
+ one for each token produced by parsing.
+ Each token includes a <tt>lexid</tt> giving its type
+ and a <tt>lexem</tt> which gives its content.
+</dl>
+
+<h2><a name="dictionaries">Dictionaries</a></h2>
+
+Dictionaries take textual tokens as input,
+usually those produced by a parser,
+and return lexemes which are usually some reduced form of the token.
+Among the dictionaries which come installed with tsearch2 are:
+
+<ul>
+<li><b>simple</b> simply folds uppercase letters to lowercase
+ before returning the word.
+<li><b>en_stem</b> runs an English Snowball stemmer on each word
+ that attempts to reduce the various forms of a verb or noun
+ to a single recognizable form.
+<li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
+</ul>
+
+Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
+
+<pre>CREATE TABLE pg_ts_dict (
+	dict_id		int not null primary key,
+	dict_name	text not null,
+	dict_init	oid,
+	dict_initoption	text,
+	dict_lemmatize	oid not null,
+	dict_comment	text
+);</pre>
+
+The <tt>dict_id</tt> and <tt>dict_name</tt>
+serve as unique identifiers for the dictionary.
+The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
+but for the built-in Snowball dictionaries
+it specifies a file from which stop words should be read.
+The <tt>dict_comment</tt> is a human-readable description of the dictionary.
+The other fields are internal function identifiers
+useful only to developers trying to implement their own dictionaries.
+<p>
+The argument named <tt><i>dictionary</i></tt>
+in each of the following functions
+should be either an integer <tt>dict_id</tt> or a textual <tt>dict_name</tt>
+identifying which dictionary should be used for the operation;
+if omitted then the current dictionary is used.
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
+<dd>
+ Selects a current dictionary for use by functions
+ that do not select a dictionary explicitly.
+<dt>
+ <tt>CREATE FUNCTION lexize(
+ <em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
+ RETURNS TEXT[]</tt>
+<dd>
+ Reduces a single word to a lexeme.
+ Note that lexemes are arrays of zero or more strings,
+ since in some languages there might be several base words
+ from which an inflected form could arise.
+</dl>
+
+<h2><a name="ranking">Ranking</a></h2>
+
+Ranking attempts to measure how relevant documents are to particular queries
+by inspecting the number of times each search word appears in the document,
+and whether different search terms occur near each other.
+Note that this information is only available in unstripped vectors &mdash;
+ranking functions will only return a useful result
+for a <tt>tsvector</tt> which still has position information!
+<p>
+Both of these ranking functions
+take an integer <i>normalization</i> option
+that specifies whether a document's length should impact its rank.
+This is often desirable,
+since a hundred-word document with five instances of a search word
+is probably more relevant than a thousand-word document with five instances.
+The option can have the values:
+
+<ul>
+<li><tt>0</tt> (the default) ignores document length.
+<li><tt>1</tt> divides the rank by the logarithm of the length.
+<li><tt>2</tt> divides the rank by the length itself.
+</ul>
+
+The two ranking functions currently available are:
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION rank(<br>
+  <em>[</em> <i>weights</i> float4[], <em>]</em>
+  <i>vector</i> tsvector, <i>query</i> tsquery,
+  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
+  ) RETURNS float4</tt>
+<dd>
+ This is the ranking function from the old version of OpenFTS,
+ and offers the ability to weight word instances more heavily
+ depending on how you have classified them.
+ The <i>weights</i> specify how heavily to weight each category of word:
+ <pre
+>{<i>D-weight</i>, <i>A-weight</i>, <i>B-weight</i>, <i>C-weight</i>}</pre>
+ If no weights are provided, then these defaults are used:
+ <pre>{0.1, 0.2, 0.4, 1.0}</pre>
+ Often weights are used to mark words from special areas of the document,
+ like the title or an initial abstract,
+ and make them more or less important than words in the document body.
+<dt>
+ <tt>CREATE FUNCTION rank_cd(<br>
+  <em>[</em> <i>K</i> int4, <em>]</em>
+  <i>vector</i> tsvector, <i>query</i> tsquery,
+  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
+  ) RETURNS float4</tt>
+<dd>
+ This function computes the cover density ranking
+ for the given document <i>vector</i> and <i>query</i>,
+ as described in Clarke, Cormack, and Tudhope's
+ &ldquo;<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
+>Relevance Ranking for One to Three Term Queries</a>&rdquo;
+ in the 1999 <i>Information Processing and Management</i>.
+ The value <i>K</i> is one of the values from their formula,
+ and defaults to&nbsp;<i>K</i>=4.
+ The examples in their paper <i>K</i>=16;
+ we can roughly describe the term
+ as stating how far apart two search terms can fall
+ before the formula begins penalizing them for lack of proximity.
+</dl>
+
+<h2><a name="headlines">Headlines</a></h2>
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION headline(<br>
+  <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
+  <i>document</i> text, <i>query</i> tsquery,
+  <em>[</em> <i>options</i> text <em>]</em><br>
+  ) RETURNS text</tt>
+<dd>
+ Every form of the the <tt>headline()</tt> function
+ accepts a <tt>document</tt> along with a <tt>query</tt>,
+ and returns one or more ellipse-separated excerpts from the document
+ in which terms from the query are highlighted.
+ The configuration with which to parse the document
+ can be specified by either its <i>id</i> or <i>ts_name</i>;
+ if none is specified that the current configuration is used instead.
+ <p>
+ An <i>options</i> string if provided should be a comma-separated list
+ of one or more &lsquo;<i>option</i><tt>=</tt><i>value</i>&rsquo; pairs.
+ The available options are:
+ <ul>
+  <li><tt>StartSel</tt>, <tt>StopSel</tt> &mdash;
+   the strings with which query words appearing in the document
+   should be delimited to distinguish them from other excerpted words.
+  <li><tt>MaxWords</tt>, <tt>MinWords</tt> &mdash;
+   limits on the shortest and longest headlines you will accept.
+  <li><tt>ShortWord</tt> &mdash;
+   this prevents your headline from beginning or ending
+   with a word which has this many characters or less.
+   The default value of <tt>3</tt> should eliminate most English
+   conjunctions and articles.
+ </ul>
+ Any unspecified options receive these defaults:
+ <pre>
+StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3
+ </pre>
+</dl>
+
+</body>
+</html>