mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-25 13:17:41 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			1073 lines
		
	
	
		
			48 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			1073 lines
		
	
	
		
			48 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 | |
| <html>
 | |
| <head>
 | |
|   <title>tsearch-v2-intro</title>
 | |
| </head>
 | |
| <body class="content">
 | |
| <div class="content">
 | |
| <h2>Tsearch2 - Introduction</h2>
 | |
| <p><a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html">
 | |
| [Online version]</a> of this document is available.</p>
 | |
| 
 | |
| <p>The tsearch2 module is available to add as an extension to the
 | |
| PostgreSQL database to allow for Full Text Indexing. This document
 | |
| is an introduction to installing, configuring, using and
 | |
| maintaining the database with the tsearch2 module activated.</p>
 | |
| <p>Please, note, tsearch2 module is fully incompatible with old
 | |
| tsearch, which is deprecated in 7.4 and will be obsoleted in
 | |
| 8.0.</p>
 | |
| <h3>USING TSEARCH2 AND POSTGRESQL FOR A WEB BASED SEARCH
 | |
| ENGINE</h3>
 | |
| <p>This documentation is provided as a short guide on how to
 | |
| quickly get up and running with tsearch2 and PostgreSQL, for those
 | |
| who want to implement a full text indexed based search engine. It
 | |
| is not meant to be a complete in-depth guide into the full ins and
 | |
| outs of the contrib/tsearch2 module, and is primarily aimed at
 | |
| beginners who want to speed up searching of large text fields, or
 | |
| those migrating from other database systems such as MS-SQL.</p>
 | |
| <p>The README.tsearch2 file included in the contrib/tsearch2
 | |
| directory contains a brief overview and history behind tsearch.
 | |
| This can also be found online <a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[right
 | |
| here]</a>.</p>
 | |
| <p>Further in depth documentation such as a full function
 | |
| reference, and user guide can be found online at the <a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/">[tsearch
 | |
| documentation home]</a>.</p>
 | |
| <h3>ACKNOWLEDGEMENTS</h3>
 | |
| 
 | |
| <p>Robert John Shepherd originally wrote this documentation for the
 | |
| previous version of tsearch module (v1) included with the postgres
 | |
| release. I took his documentation and updated it to comply with the
 | |
| tsearch2 modifications.</p>
 | |
| <p>Robert's original acknowledgements:</p>
 | |
| <p>"Thanks to Oleg Bartunov for taking the time to answer many of
 | |
| my questions regarding this module, and also to Teodor Sigaev for
 | |
| clearing up the process of making your own dictionaries. Plus of
 | |
| course a big thanks to the pair of them for writing this module in
 | |
| the first place!"</p>
 | |
| <p>I would also like to extend my thanks to the developers, and
 | |
| Oleg Bartunov for all of his direction and help with the new
 | |
| features of tsearch2.</p>
 | |
| <h3>OVERVIEW</h3>
 | |
| <p>MS-SQL provides a full text indexing (FTI) system which enables
 | |
| the fast searching of text based fields, very useful for websites
 | |
| (and other applications) that require a results set based on key
 | |
| words. PostgreSQL ships with a contributed module called tsearch2,
 | |
| which implements a special type of index that can also be used for
 | |
| full text indexing. Further more, unlike MS' offering which
 | |
| requires regular incremental rebuilds of the text indexes
 | |
| themselves, tsearch2 indexes are always up-to-date and keeping them
 | |
| so induces very little overhead.</p>
 | |
| <p>Before we get into the details, it is recommended that you have
 | |
| installed and tested PostgreSQL, are reasonably familiar with
 | |
| databases, the SQL query language and also understand the basics of
 | |
| connecting to PostgreSQL from the local shell. This document isn't
 | |
| intended for the complete PostgreSQL newbie, but anyone with a
 | |
| reasonable grasp of the basics should be able to follow it.</p>
 | |
| <h3>INSTALLATION</h3>
 | |
| <p>Starting with PostgreSQL version 7.4 tsearch2 is now included in
 | |
| the contrib directory with the PostgreSQL sources. contrib/tsearch2
 | |
| is where you will find everything needed to install and use
 | |
| tsearch2. Please note that tsearch2 will also work with PostgreSQL
 | |
| version 7.3.x, but it is not the module included with the source
 | |
| distribution. You will have to download the module separately and
 | |
| install it in the same fashion.</p>
 | |
| 
 | |
| <p>I installed the tsearch2 module to a PostgreSQL 7.3 database
 | |
| from the contrib directory without squashing the original (old)
 | |
| tsearch module. What I did was move the modules tsearch src
 | |
| driectory into the contrib tree under the name tsearchV2.</p>
 | |
| <p>Step one is to download the tsearch V2 module :</p>
 | |
| <p><a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">[http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/]</a>
 | |
| (check Development History for latest stable version !)</p>
 | |
| <pre>
 | |
|         tar -zxvf tsearch-v2.tar.gz
 | |
|         mv tsearch2 $PGSQL_SRC/contrib/
 | |
|         cd $PGSQL_SRC/contrib/tsearch2
 | |
| </pre>
 | |
| <p>If you are installing from PostgreSQL version 7.4 or higher, you
 | |
| can skip those steps and just change to the contrib/tsearch2
 | |
| directory in the source tree and continue from there.</p>
 | |
| <p>As of May 9, 2004 there is a source patch available for
 | |
| tsearch2. The patch provides changes to the pg_ts_ configuration
 | |
| tables to allow for easy dump and restore of a database containing
 | |
| tsearch2. The patch is available here : <a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz">
 | |
| [http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_7.4.patch.gz]</a></p>
 | |
| 
 | |
| <p>To apply this patch, download the mentioned file and place it in
 | |
| your postgreSQL source tree ($PGSQL_SRC). This patch is not
 | |
| required for tsearch2 to work. I would however, highly recommend it
 | |
| as it makes the backup and restore procedures very simple.</p>
 | |
| <pre>
 | |
|       cd $PGSQL_SRC
 | |
|         gunzip regprocedure_7.4.patch.gz
 | |
|         patch -b -p1 < regprocedure_7.4.patch
 | |
| </pre>
 | |
| <p>If you have a working version of tsearch2 in your database, you
 | |
| do not need to re-install the tsearch2 module. Just apply the patch
 | |
| and run make. This patch only affects the tsearch2.sql file. You
 | |
| can run the SQL script found : <a href="
 | |
| http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/regprocedure_update.sql">
 | |
| [right here]</a> This script will make the modifications found in
 | |
| the patch, and update the fields from the existing data. From this
 | |
| point on, you can dump and restore the database in a normal
 | |
| fashion. Without this patch, you must follow the instructions later
 | |
| in this document for backup and restore.</p>
 | |
| <p>This patch is only needed for tsearch2 in PostgreSQL versions
 | |
| 7.3.x and 7.4.x. The patch has been applied to the sources for
 | |
| 8.0.x.</p>
 | |
| <p>When you have your source tree for tsearch2 ready, you can
 | |
| continue with the regular building and installation process</p>
 | |
| 
 | |
| <pre>
 | |
|         gmake
 | |
|         gmake install
 | |
|         gmake installcheck
 | |
| </pre>
 | |
| <p>That is pretty much all you have to do, unless of course you get
 | |
| errors. However if you get those, you better go check with the
 | |
| mailing lists over at <a href="
 | |
| http://www.postgresql.org">http://www.postgresql.org</a> or
 | |
| <a href="
 | |
| http://openfts.sourceforge.net/">http://openfts.sourceforge.net/</a>
 | |
| since its never failed for me.</p>
 | |
| <p>If you ever need to revert this patch, and go back to the
 | |
| unpatched version of tsearch2, it is simple if you followed the
 | |
| above patch command. The -b option creates a backup of the original
 | |
| file, so we can just copy it back.</p>
 | |
| <pre>
 | |
|      cd $PGSQL_SRC/contrib/tsearch2
 | |
|         cp tsearch.sql.in.orig tsearch.sql.in
 | |
|         make
 | |
| 
 | |
| </pre>
 | |
| <p>If you need the patched version again, just follow the patch
 | |
| instructions again.</p>
 | |
| <p>The directory in the contib/ and the directory from the archive
 | |
| is called tsearch2. Tsearch2 is completely incompatible with the
 | |
| previous version of tsearch. This means that both versions can be
 | |
| installed into a single database, and migration the new version may
 | |
| be much easier.</p>
 | |
| <p>NOTE: the previous version of tsearch found in the
 | |
| contrib/tsearch directory is depricated. Although it is still
 | |
| available and included within PostgreSQL version 7.4. It will be
 | |
| removed in version 8.0.</p>
 | |
| <h3>ADDING TSEARCH2 FUNCTIONALITY TO A DATABASE</h3>
 | |
| <p>We should create a database to use as an example for the
 | |
| remainder of this file. We can call the database "ftstest". You can
 | |
| create it from the command line like this:</p>
 | |
| <pre>
 | |
|         #createdb ftstest
 | |
| </pre>
 | |
| <p>If you thought installation was easy, this next bit is even
 | |
| easier. Change to the PGSQL_SRC/contrib/tsearch2 directory and
 | |
| type:</p>
 | |
| 
 | |
| <pre>
 | |
|         psql ftstest < tsearch2.sql
 | |
| </pre>
 | |
| <p>The file "tsearch2.sql" holds all the wonderful little goodies
 | |
| you need to do full text indexing. It defines numerous functions
 | |
| and operators, and creates the needed tables in the database. There
 | |
| will be 4 new tables created after running the tsearch2.sql file :
 | |
| pg_ts_dict, pg_ts_parser, pg_ts_cfg, pg_ts_cfgmap are added.</p>
 | |
| <p>You can check out the tables if you like:</p>
 | |
| <pre>
 | |
|         #psql ftstest
 | |
|         ftstest=# \d
 | |
|                     List of relations
 | |
|          Schema |     Name     | Type  |  Owner
 | |
|         --------+--------------+-------+----------
 | |
|          public | pg_ts_cfg    | table | kopciuch
 | |
|          public | pg_ts_cfgmap | table | kopciuch
 | |
|          public | pg_ts_dict   | table | kopciuch
 | |
|          public | pg_ts_parser | table | kopciuch
 | |
|         (4 rows)
 | |
| </pre>
 | |
| <h3>TYPES AND FUNCTIONS PROVIDED BY TSEARCH2</h3>
 | |
| <p>The first thing we can do is try out some of the types that are
 | |
| provided for us. Lets look at the tsvector type provided for
 | |
| us:</p>
 | |
| 
 | |
| <pre>
 | |
|         SELECT 'Our first string used today'::tsvector;
 | |
|                         tsvector
 | |
|         ---------------------------------------
 | |
|          'Our' 'used' 'first' 'today' 'string'
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>The results are the words used within our string. Notice they
 | |
| are not in any particular order. The tsvector type returns a string
 | |
| of space separated words.</p>
 | |
| <pre>
 | |
|         SELECT 'Our first string used today first string'::tsvector;
 | |
|                             tsvector
 | |
|         -----------------------------------------------
 | |
|          'Our' 'used' 'first' 'today' 'string'
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Notice the results string has each unique word ('first' and
 | |
| 'string' only appear once in the tsvector value). Which of course
 | |
| makes sense if you are searching the full text ... you only need to
 | |
| know each unique word in the text.</p>
 | |
| <p>Those examples were just casting a text field to that of type
 | |
| tsvector. Lets check out one of the new functions created by the
 | |
| tsearch2 module.</p>
 | |
| <p>The function to_tsvector has 3 possible signatures:</p>
 | |
| <pre>
 | |
| 
 | |
|         to_tsvector(oid, text);
 | |
|         to_tsvector(text, text);
 | |
|         to_tsvector(text);
 | |
| </pre>
 | |
| <p>We will use the second method using two text fields. The
 | |
| overloaded methods provide us with a way to specifiy the way the
 | |
| searchable text is broken up into words (Stemming process). Right
 | |
| now we will specify the 'default' configuration. See the section on
 | |
| TSEARCH2 CONFIGURATION to learn more about this.</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('default',
 | |
|                            'Our first string used today first string');
 | |
|                         to_tsvector
 | |
|         --------------------------------------------
 | |
|          'use':4 'first':2,6 'today':5 'string':3,7
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>The result returned from this function is of type tsvector. The
 | |
| results came about by this reasoning: All of the words in the text
 | |
| passed in are stemmed, or not used because they are stop words
 | |
| defined in our configuration. Each lower case morphed word is
 | |
| returned with all of the positons in the text.</p>
 | |
| <p>In this case the word "Our" is a stop word in the default
 | |
| configuration. That means it will not be included in the result.
 | |
| The word "first" is found at positions 2 and 6 (although "Our" is a
 | |
| stop word, it's position is maintained). The word(s) positioning is
 | |
| maintained exactly as in the original string. The word "used" is
 | |
| morphed to the word "use" based on the default configuration for
 | |
| word stemming, and is found at position 4. The rest of the results
 | |
| follow the same logic. Just a reminder again ... the order of the
 | |
| 'word' position in the output is not in any kind of order. (ie
 | |
| 'use':4 appears first)</p>
 | |
| <p>If you want to view the output of the tsvector fields without
 | |
| their positions, you can do so with the function
 | |
| "strip(tsvector)".</p>
 | |
| <pre>
 | |
|         SELECT strip(to_tsvector('default',
 | |
|                      'Our first string used today first string'));
 | |
|                     strip
 | |
|         --------------------------------
 | |
|          'use' 'first' 'today' 'string'
 | |
| 
 | |
| </pre>
 | |
| <p>If you wish to know the number of unique words returned in the
 | |
| tsvector you can do so by using the function "length(tsvector)"</p>
 | |
| <pre>
 | |
|         SELECT length(to_tsvector('default',
 | |
|                       'Our first string used today first string'));
 | |
|          length
 | |
|         --------
 | |
|               4
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Lets take a look at the function to_tsquery. It also has 3
 | |
| signatures which follow the same rational as the to_tsvector
 | |
| function:</p>
 | |
| <pre>
 | |
|         to_tsquery(oid, text);
 | |
|         to_tsquery(text, text);
 | |
|         to_tsquery(text);
 | |
| </pre>
 | |
| <p>Lets try using the function with a single word :</p>
 | |
| <pre>
 | |
|         SELECT to_tsquery('default', 'word');
 | |
|          to_tsquery
 | |
|         -----------
 | |
|          'word'
 | |
|          (1 row)
 | |
| 
 | |
| </pre>
 | |
| <p>I call the function the same way I would a to_tsvector function,
 | |
| specifying the 'default' configuration for morphing, and the result
 | |
| is the stemmed output 'word'.</p>
 | |
| <p>Lets attempt to use the function with a string of multiple
 | |
| words:</p>
 | |
| <pre>
 | |
|         SELECT to_tsquery('default', 'this is many words');
 | |
|         ERROR:  Syntax error
 | |
| </pre>
 | |
| <p>The function can not accept a space separated string. The
 | |
| intention of the to_tsquery function is to return a type of
 | |
| "tsquery" used for searching a tsvector field. What we need to do
 | |
| is search for one to many words with some kind of logic (for now
 | |
| simple boolean).</p>
 | |
| <pre>
 | |
|         SELECT to_tsquery('default', 'searching|sentence');
 | |
|               to_tsquery
 | |
|         ----------------------
 | |
|          'search' | 'sentenc'
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Notice that the words are separated by the boolean logic "OR",
 | |
| the text could contain boolean operators &,|,!,() with their
 | |
| usual meaning.</p>
 | |
| 
 | |
| <p>You can not use words defined as being a stop word in your
 | |
| configuration. The function will not fail ... you will just get no
 | |
| result, and a NOTICE like this:</p>
 | |
| <pre>
 | |
|         SELECT to_tsquery('default', 'a|is&not|!the');
 | |
|         NOTICE:  Query contains only stopword(s)
 | |
|                  or doesn't contain lexem(s), ignored
 | |
|          to_tsquery
 | |
|         -----------
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>That is a beginning to using the types, and functions defined in
 | |
| the tsearch2 module. There are numerous more functions that I have
 | |
| not touched on. You can read through the tsearch2.sql file built
 | |
| when compiling to get more familiar with what is included.</p>
 | |
| <h3>INDEXING FIELDS IN A TABLE</h3>
 | |
| <p>The next stage is to add a full text index to an existing table.
 | |
| In this example we already have a table defined as follows:</p>
 | |
| <pre>
 | |
|         CREATE TABLE tblMessages
 | |
|         (
 | |
|                 intIndex        int4,
 | |
|                 strTopic        varchar(100),
 | |
|                 strMessage      text
 | |
|         );
 | |
| </pre>
 | |
| 
 | |
| <p>We are assuming there are several rows with some kind of data in
 | |
| them. Any data will do, just do several inserts with test strings
 | |
| for a topic, and a message. here is some test data I inserted. (yes
 | |
| I know it's completely useless stuff ;-) but it will serve our
 | |
| purpose right now).</p>
 | |
| <pre>
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('1', 'Testing Topic', 'Testing message data input');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('2', 'Movie', 'Breakfast at Tiffany\'s');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('3', 'Famous Author', 'Stephen King');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('4', 'Political Topic',
 | |
|                             'Nelson Mandella is released from prison');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('5', 'Nursery rhyme phrase',
 | |
|                             'Little jack horner sat in a corner');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('6', 'Gettysburg address quotation',
 | |
|                             'Four score and seven years ago'
 | |
|                             ' our fathers brought forth on this'
 | |
|                             ' continent a new nation, conceived in'
 | |
|                             ' liberty and dedicated to the proposition'
 | |
|                             ' that all men are created equal');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('7', 'Classic Rock Bands',
 | |
|                             'Led Zeppelin Grateful Dead and The Sex Pistols');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('8', 'My birth address',
 | |
|                             '18 Sommervile road, Regina, Saskatchewan');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('9', 'Joke', 'knock knock : who\'s there?'
 | |
|                                     ' I will not finish this joke');
 | |
|         INSERT INTO tblMessages
 | |
|                VALUES ('10', 'Computer information',
 | |
|                              'My computer is a pentium III 400 mHz'
 | |
|                              ' with 192 megabytes of RAM');
 | |
| </pre>
 | |
| <p>The next stage is to create a special text index which we will
 | |
| use for FTI, so we can search our table of messages for words or a
 | |
| phrase. We do this using the SQL command:</p>
 | |
| <pre>
 | |
|         ALTER TABLE tblMessages ADD COLUMN idxFTI tsvector;
 | |
| </pre>
 | |
| <p>Note that unlike traditional indexes, this is actually a new
 | |
| field in the same table, which is then used (through the magic of
 | |
| the tsearch2 operators and functions) by a special index we will
 | |
| create in a moment.</p>
 | |
| <p>The general rule for the initial insertion of data will follow
 | |
| four steps:</p>
 | |
| <pre>
 | |
| 
 | |
|     1. update table
 | |
|     2. vacuum full analyze
 | |
|     3. create index
 | |
|     4. vacuum full analyze
 | |
| </pre>
 | |
| <p>The data can be updated into the table, the vacuum full analyze
 | |
| will reclaim unused space. The index can be created on the table
 | |
| after the data has been inserted. Having the index created prior to
 | |
| the update will slow down the process. It can be done in that
 | |
| manner, this way is just more efficient. After the index has been
 | |
| created on the table, vacuum full analyze is run again to update
 | |
| postgres's statistics (ie having the index take effect).</p>
 | |
| <pre>
 | |
|         UPDATE tblMessages SET idxFTI=to_tsvector('default', strMessage);
 | |
|         VACUUM FULL ANALYZE;
 | |
| </pre>
 | |
| <p>Note that this only inserts the field strMessage as a tsvector,
 | |
| so if you want to also add strTopic to the information stored, you
 | |
| should instead do the following, which effectively concatenates the
 | |
| two fields into one before being inserted into the table:</p>
 | |
| <pre>
 | |
|         UPDATE tblMessages
 | |
|             SET idxFTI=to_tsvector('default',coalesce(strTopic,'') ||' '|| coalesce(strMessage,''));
 | |
|         VACUUM FULL ANALYZE;
 | |
| </pre>
 | |
| <p><strong>Using the coalesce function makes sure this
 | |
| concatenation also works with NULL fields.</strong></p>
 | |
| 
 | |
| <p>We need to create the index on the column idxFTI. Keep in mind
 | |
| that the database will update the index when some action is taken.
 | |
| In this case we _need_ the index (The whole point of Full Text
 | |
| INDEXINGi ;-)), so don't worry about any indexing overhead. We will
 | |
| create an index based on the gist function. GiST is an index
 | |
| structure for Generalized Search Tree.</p>
 | |
| <pre>
 | |
|         CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);
 | |
|         VACUUM FULL ANALYZE;
 | |
| </pre>
 | |
| <p>After you have converted all of your data and indexed the
 | |
| column, you can select some rows to see what actually happened. I
 | |
| will not display output here but you can play around yourselves and
 | |
| see what happened.</p>
 | |
| <p>The last thing to do is set up a trigger so every time a row in
 | |
| this table is changed, the text index is automatically updated.
 | |
| This is easily done using:</p>
 | |
| <pre>
 | |
|         CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
 | |
|             FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, strMessage);
 | |
| </pre>
 | |
| <p>Or if you are indexing both strMessage and strTopic you should
 | |
| instead do:</p>
 | |
| <pre>
 | |
| 
 | |
|         CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
 | |
|             FOR EACH ROW EXECUTE PROCEDURE
 | |
|                 tsearch2(idxFTI, strTopic, strMessage);
 | |
| </pre>
 | |
| <p>Before you ask, the tsearch2 function accepts multiple fields as
 | |
| arguments so there is no need to concatenate the two into one like
 | |
| we did before.</p>
 | |
| <p>If you want to do something specific with columns, you may write
 | |
| your very own trigger function using plpgsql or other procedural
 | |
| languages (but not SQL, unfortunately) and use it instead of
 | |
| <em>tsearch2</em> trigger.</p>
 | |
| <p>You could however call other stored procedures from within the
 | |
| tsearch2 function. Lets say we want to create a function to remove
 | |
| certain characters (like the @ symbol from all text).</p>
 | |
| <pre>
 | |
|        CREATE FUNCTION dropatsymbol(text) 
 | |
|                      RETURNS text AS 'select replace($1, \'@\', \' \');' LANGUAGE SQL;
 | |
| </pre>
 | |
| <p>Now we can use this function within the tsearch2 function on the
 | |
| trigger.</p>
 | |
| 
 | |
| <pre>
 | |
|       DROP TRIGGER tsvectorupdate ON tblmessages;
 | |
|         CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT ON tblMessages
 | |
|             FOR EACH ROW EXECUTE PROCEDURE tsearch2(idxFTI, dropatsymbol, strMessage);
 | |
|         INSERT INTO tblmessages VALUES (69, 'Attempt for dropatsymbol', 'Test@test.com');
 | |
| </pre>
 | |
| <p>If at this point you receive an error stating: ERROR: Can't find
 | |
| tsearch config by locale</p>
 | |
| <p>Do not worry. You have done nothing wrong. And tsearch2 is not
 | |
| broken. All that has happened here is that the configuration is
 | |
| setup to use a configuration based on the locale of the server. All
 | |
| you have to do is change your default configuration, or add a new
 | |
| one for your specific locale. See the section on TSEARCH2
 | |
| CONFIGURATION.</p>
 | |
| <pre class="real">
 | |
|    SELECT * FROM tblmessages WHERE intindex = 69;
 | |
| 
 | |
|          intindex |         strtopic         |  strmessage   |        idxfti
 | |
|         ----------+--------------------------+---------------+-----------------------   
 | |
|                 69 | Attempt for dropatsymbol | Test@test.com | 'test':1 'test.com':2
 | |
|         (1 row)
 | |
| </pre>
 | |
| Notice that the string content was passed throught the stored
 | |
| procedure dropatsymbol. The '@' character was replaced with a
 | |
| single space ... and the output from the procedure was then stored
 | |
| in the tsvector column.
 | |
| <p>This could be useful for removing other characters from indexed
 | |
| text, or any kind of preprocessing needed to be done on the text
 | |
| prior to insertion into the index.</p>
 | |
| <h3>QUERYING A TABLE</h3>
 | |
| 
 | |
| <p>There are some examples in the README.tsearch2 file for querying
 | |
| a table. One major difference between tsearch and tsearch2 is the
 | |
| operator ## is no longer available. Only the operator @@ is
 | |
| defined, using the types tsvector on one side and tsquery on the
 | |
| other side.</p>
 | |
| <p>Lets search the indexed data for the word "Test". I indexed
 | |
| based on the the concatenation of the strTopic, and the
 | |
| strMessage:</p>
 | |
| <pre>
 | |
|         SELECT intindex, strtopic FROM tblmessages
 | |
|                                   WHERE idxfti @@ 'test'::tsquery;
 | |
|          intindex |   strtopic
 | |
|         ----------+---------------
 | |
|                 1 | Testing Topic
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>The only result that matched was the row with a topic "Testing
 | |
| Topic". Notice that the word I search for was all lowercase. Let's
 | |
| see what happens when I query for uppercase "Test".</p>
 | |
| <pre>
 | |
|         SELECT intindex, strtopic FROM tblmessages
 | |
|                                   WHERE idxfti @@ 'Test'::tsquery;
 | |
|          intindex | strtopic
 | |
|         ----------+----------
 | |
|         (0 rows)
 | |
| </pre>
 | |
| <p>We get zero rows returned. The reason is because when the text
 | |
| was inserted, it was morphed to my default configuration (because
 | |
| of the call to to_tsvector in the UPDATE statement). If there was
 | |
| no morphing done, and the tsvector field(s) contained the word
 | |
| 'Text', a match would have been found.</p>
 | |
| <p>Most likely the best way to query the field is to use the
 | |
| to_tsquery function on the right hand side of the @@ operator like
 | |
| this:</p>
 | |
| 
 | |
| <pre>
 | |
|         SELECT intindex, strtopic FROM tblmessages
 | |
|                WHERE idxfti @@ to_tsquery('default', 'Test | Zeppelin');
 | |
|          intindex |      strtopic
 | |
|         ----------+--------------------
 | |
|                 1 | Testing Topic
 | |
|                 7 | Classic Rock Bands
 | |
|         (2 rows)
 | |
| </pre>
 | |
| <p>That query searched for all instances of "Test" OR "Zeppelin".
 | |
| It returned two rows: the "Testing Topic" row, and the "Classic
 | |
| Rock Bands" row. The to_tsquery function performed the correct
 | |
| morphology upon the parameters, and searched the tsvector field
 | |
| appropriately.</p>
 | |
| <p>The last example here relates to searching for a phrase, for
 | |
| example "minority report". This poses a problem with regard to
 | |
| tsearch2, as it doesn't index phrases, only words. But there is a
 | |
| way around which doesn't appear to have a significant impact on
 | |
| query time, and that is to use a query such as the following:</p>
 | |
| <pre>
 | |
|         SELECT intindex, strTopic FROM tblmessages
 | |
|                 WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
 | |
|                 AND strMessage ~* '.*men are created equal.*';
 | |
|          intindex |           strtopic
 | |
|         ----------+------------------------------
 | |
|                 6 | Gettysburg address quotation
 | |
|         (1 row)
 | |
|         SELECT intindex, strTopic FROM tblmessages
 | |
|                 WHERE idxfti @@ to_tsquery('default', 'gettysburg & address')
 | |
|                 AND strMessage ~* '.*something that does not exist.*';
 | |
|          intindex | strtopic
 | |
|         ----------+----------
 | |
|         (0 rows)
 | |
| </pre>
 | |
| <p>Of course if your indexing both strTopic and strMessage, and
 | |
| want to search for this phrase on both, then you will have to get
 | |
| out the brackets and extend this query a little more.</p>
 | |
| 
 | |
| <h3>TSEARCH2 CONFIGURATION</h3>
 | |
| <p>Some words such as "and", "the", and "who" are automatically not
 | |
| indexed, since they belong to a pre-existing dictionary of "Stop
 | |
| Words" which tsearch2 does not perform indexing on. If someone
 | |
| needs to search for "The Who" in your database, they are going to
 | |
| have a tough time coming up with any results, since both are
 | |
| ignored in the indexes. But there is a solution.</p>
 | |
| <p>Lets say we want to add a word into the stop word list for
 | |
| english stemming. We could edit the file
 | |
| :'/usr/local/pgsql/share/english.stop' and add a word to the list.
 | |
| I edited mine to exclude my name from indexing:</p>
 | |
| <pre>
 | |
|     - Edit /usr/local/pgsql/share/english.stop
 | |
|     - Add 'andy' to the list
 | |
|     - Save the file.
 | |
| </pre>
 | |
| <p>When you connect to the database, the dict_init procedure is run
 | |
| during initialization. And in my configuration it will read the
 | |
| stop words from the file I just edited. If you were connected to
 | |
| the DB while editing the stop words, you will need to end the
 | |
| current session and re-connect. When you re-connect to the
 | |
| database, 'andy' is no longer indexed:</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('default', 'Andy');
 | |
|          to_tsvector
 | |
|         ------------
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Originally I would get the result :</p>
 | |
| 
 | |
| <pre>
 | |
|         SELECT to_tsvector('default', 'Andy');
 | |
|          to_tsvector
 | |
|         ------------
 | |
|          'andi':1
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>But since I added it as a stop word, it would be ingnored on the
 | |
| indexing. The stop word added was used in the dictionary "en_stem".
 | |
| If I were to use a different configuration such as 'simple', the
 | |
| results would be different. There are no stop words for the simple
 | |
| dictionary. It will just convert to lower case, and index every
 | |
| unique word.</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('simple', 'Andy andy The the in out');
 | |
|                      to_tsvector
 | |
|         -------------------------------------
 | |
|          'in':5 'out':6 'the':3,4 'andy':1,2
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>All this talk about which configuration to use is leading us
 | |
| into the actual configuration of tsearch2. In the examples in this
 | |
| document the configuration has always been specified when using the
 | |
| tsearch2 functions:</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('default', 'Testing the default config');
 | |
|         SELECT to_tsvector('simple', 'Example of simple Config');
 | |
| </pre>
 | |
| <p>The pg_ts_cfg table holds each configuration you can use with
 | |
| the tsearch2 functions. As you can see the ts_name column contains
 | |
| both the 'default' configurations based on the 'C' locale. And the
 | |
| 'simple' configuration which is not based on any locale.</p>
 | |
| 
 | |
| <pre>
 | |
|         SELECT * from pg_ts_cfg;
 | |
|              ts_name     | prs_name |    locale
 | |
|         -----------------+----------+--------------
 | |
|          default         | default  | C
 | |
|          default_russian | default  | ru_RU.KOI8-R
 | |
|          simple          | default  |
 | |
|         (3 rows)
 | |
| </pre>
 | |
| <p>Each row in the pg_ts_cfg table contains the name of the
 | |
| tsearch2 configuration, the name of the parser to use, and the
 | |
| locale mapped to the configuration. There is only one parser to
 | |
| choose from the table pg_ts_parser called 'default'. More parsers
 | |
| could be written, but for our needs we will use the default.</p>
 | |
| <p>There are 3 configurations installed by tsearch2 initially. If
 | |
| your locale is set to 'en_US' for example (like my laptop), then as
 | |
| you can see there is currently no dictionary configured to use with
 | |
| that locale. You can either set up a new configuration or just use
 | |
| one that already exists. If I do not specify which configuration to
 | |
| use in the to_tsvector function, I receive the following error.</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('learning tsearch is like going to school');
 | |
|         ERROR:  Can't find tsearch config by locale
 | |
| </pre>
 | |
| <p>We will create a new configuration for use with the server
 | |
| encoding 'en_US'. The first step is to add a new configuration into
 | |
| the pg_ts_cfg table. We will call the configuration
 | |
| 'default_english', with the default parser and use the locale
 | |
| 'en_US'.</p>
 | |
| <pre>
 | |
|         INSERT INTO pg_ts_cfg (ts_name, prs_name, locale)
 | |
|                VALUES ('default_english', 'default', 'en_US');
 | |
| 
 | |
| </pre>
 | |
| <p>We have only declared that there is a configuration called
 | |
| 'default_english'. We need to set the configuration of how
 | |
| 'default_english' will work. The next step is creating a new
 | |
| dictionary to use. The configuration of the dictionary is
 | |
| completlely different in tsearch2. In the prior versions to make
 | |
| changes, you would have to re-compile your changes into the
 | |
| tsearch.so. All of the configuration has now been moved into the
 | |
| system tables created by executing the SQL code from
 | |
| tsearch2.sql</p>
 | |
| <p>Lets take a first look at the pg_ts_dict table</p>
 | |
| <pre>
 | |
|   ftstest=# \d pg_ts_dict
 | |
|          Table "public.pg_ts_dict"
 | |
|              Column      |     Type     | Modifiers
 | |
|         -----------------+--------------+-----------
 | |
|          dict_name       | text         | not null
 | |
|          dict_init       | regprocedure |
 | |
|          dict_initoption | text         |
 | |
|          dict_lexize     | regprocedure | not null
 | |
|          dict_comment    | text         |
 | |
|         Indexes: pg_ts_dict_pkey primary key btree (dict_name)
 | |
| </pre>
 | |
| <p>The dict_name column is the name of the dictionary, for example
 | |
| 'simple', 'en_stem' or 'ru_stem'. The dict_init column is a text
 | |
| representation of a stored procedure to run for initialization of
 | |
| that dictionary, for example 'snb_en_init(text)' or
 | |
| 'snb_ru_init(text)'. The initial configuration of tsearch2 had the
 | |
| dict_init and dict_lexize columns as type oid. The patch mentioned
 | |
| in the Installation Notes changes these types to regprocedure. The
 | |
| data inserted, or updated can still be the oid of the stored
 | |
| procedure. The representation is just different. This makes backup
 | |
| and restore procedures much easier for tsearch2. The dict_init
 | |
| option is used for options passed to the init function for the
 | |
| stored procedure. In the cases of 'en_stem' or 'ru_stem' it is a
 | |
| path to a stopword file for that dictionary, for example
 | |
| '/usr/local/pgsql/share/english.stop'. This is however dictated by
 | |
| the dictionary. ISpell dictionaries may require different options.
 | |
| The dict_lexize column is another OID of a stored procedure to the
 | |
| function used to lexize, for example 'snb_lexize(internal,
 | |
| internal, integer)'. The dict_comment column is just a comment.</p>
 | |
| <p>Next we will configure the use of a new dictionary based on
 | |
| ISpell. We will assume you have ISpell installed on you machine.
 | |
| (in /usr/local/lib)</p>
 | |
| <p>There has been some confusion in the past as to which files are
 | |
| used from ISpell. ISpell operates using a hash file. This is a
 | |
| binary file created by the ISpell command line utility "buildhash".
 | |
| This utility accepts a file containing the words from the
 | |
| dictionary, and the affixes file and the output is the hash file.
 | |
| The default installation of ISPell installs the english hash file
 | |
| english.hash, which is the exact same file as american.hash. ISpell
 | |
| uses this as the fallback dictionary to use.</p>
 | |
| <p>This hash file is not what tsearch2 requires as the ISpell
 | |
| interface. The file(s) needed are those used to create the hash.
 | |
| Tsearch uses the dictionary words for morphology, so the listing is
 | |
| needed not spellchecking. Regardless, these files are included in
 | |
| the ISpell sources, and you can use them to integrate into
 | |
| tsearch2. This is not complicated, but is not very obvious to begin
 | |
| with. The tsearch2 ISpell interface needs only the listing of
 | |
| dictionary words, it will parse and load those words, and use the
 | |
| ISpell dictionary for lexem processing.</p>
 | |
| 
 | |
| <p>I found the ISPell make system to be very finicky. Their
 | |
| documentation actually states this to be the case. So I just did
 | |
| things the command line way. In the ISpell source tree under
 | |
| languages/english there are several files in this directory. For a
 | |
| complete description, please read the ISpell README. Basically for
 | |
| the english dictionary there is the option to create the small,
 | |
| medium, large and extra large dictionaries. The medium dictionary
 | |
| is recommended. If the make system is configured correctly, it
 | |
| would build and install the english.has file from the medium size
 | |
| dictionary. Since we are only concerned with the dictionary word
 | |
| listing ... it can be created from the /languages/english directory
 | |
| with the following command:</p>
 | |
| <pre>
 | |
|        sort -u -t/ +0f -1 +0 -T /usr/tmp -o english.med english.0 english.1
 | |
| </pre>
 | |
| <p>This will create a file called english.med. You can copy this
 | |
| file to whever you like. I placed mine in /usr/local/lib so it
 | |
| coincides with the ISpell hash files. You can now add the tsearch2
 | |
| configuration entry for the ISpell english dictionary. We will also
 | |
| continue to use the english word stop file that was installed for
 | |
| the en_stem dictionary. You could use a different one if you like.
 | |
| The ISpell configuration is based on the "ispell_template"
 | |
| dictionary installed by default with tsearch2. We will use the OIDs
 | |
| to the stored procedures from the row where the dict_name =
 | |
| 'ispell_template'.</p>
 | |
| <pre>
 | |
|         INSERT INTO pg_ts_dict
 | |
|                (SELECT 'en_ispell',
 | |
|                        dict_init,
 | |
|                        'DictFile="/usr/local/lib/english.med",'
 | |
|                        'AffFile="/usr/local/lib/english.aff",'
 | |
|                        'StopFile="/usr/local/pgsql/share/contrib/english.stop"',
 | |
|                        dict_lexize
 | |
|                 FROM pg_ts_dict
 | |
|                 WHERE dict_name = 'ispell_template');
 | |
| </pre>
 | |
| <p>Now that we have a dictionary we can specify it's use in a query
 | |
| to get a lexem. For this we will use the lexize function. The
 | |
| lexize function takes the name of the dictionary to use as an
 | |
| argument. Just as the other tsearch2 functions operate. You will
 | |
| need to stop your psql session and start it again in order for this
 | |
| modification to take place.</p>
 | |
| <pre>
 | |
|   SELECT lexize('en_ispell', 'program');
 | |
|           lexize
 | |
|         -----------
 | |
|          {program}
 | |
|         (1 row)
 | |
| 
 | |
| </pre>
 | |
| <p>If you wanted to always use the ISpell english dictionary you
 | |
| have installed, you can configure tsearch2 to always use a specific
 | |
| dictionary.</p>
 | |
| <pre>
 | |
|   SELECT set_curdict('en_ispell');
 | |
| </pre>
 | |
| <p>Lexize is meant to turn a word into a lexem. It is possible to
 | |
| receive more than one lexem returned for a single word.</p>
 | |
| <pre>
 | |
|  SELECT lexize('en_ispell', 'conditionally');
 | |
|            lexize
 | |
|         -----------------------------
 | |
|          {conditionally,conditional}
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>The lexize function is not meant to take a full string as an
 | |
| argument to return lexems for. If you passed in an entire sentence,
 | |
| it attempts to find that entire sentence in the dictionary. Since
 | |
| the dictionary contains only words, you will receive an empty
 | |
| result set back.</p>
 | |
| <pre>
 | |
|       SELECT lexize('en_ispell', 'This is a senctece to lexize');
 | |
|          lexize
 | |
|         --------
 | |
|         
 | |
|         (1 row)
 | |
|         
 | |
| If you parse a lexem from a word not in the dictionary, then you will receive an empty result. This makes sense because the word "tsearch" is not in the english dictionary. You can create your own additions to the dictionary if you like. This may be useful for scientific or technical glossaries that need to be indexed. SELECT lexize('en_ispell', 'tsearch'); lexize -------- (1 row)
 | |
| 
 | |
| </pre>
 | |
| <p>This is not to say that tsearch will be ignored when adding text
 | |
| information to the the tsvector index column. This will be
 | |
| explained in greater detail with the table pg_ts_cfgmap.</p>
 | |
| <p>Next we need to set up the configuration for mapping the
 | |
| dictionay use to the lexxem parsings. This will be done by altering
 | |
| the pg_ts_cfgmap table. We will insert several rows, specifying to
 | |
| use the new dictionary we installed and configured for lexizing
 | |
| within tsearch2. There are several type of lexims we would be
 | |
| concerned with forcing the use of the ISpell dictionary.</p>
 | |
| <pre>
 | |
|         INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
 | |
|                VALUES ('default_english', 'lhword', '{en_ispell,en_stem}');
 | |
|         INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
 | |
|                VALUES ('default_english', 'lpart_hword', '{en_ispell,en_stem}');
 | |
|         INSERT INTO pg_ts_cfgmap (ts_name, tok_alias, dict_name)
 | |
|                VALUES ('default_english', 'lword', '{en_ispell,en_stem}');
 | |
| </pre>
 | |
| <p>We have just inserted 3 records to the configuration mapping,
 | |
| specifying that the lexem types for "lhword, lpart_hword and lword"
 | |
| are to be stemmed using the 'en_ispell' dictionary we added into
 | |
| pg_ts_dict, when using the configuration ' default_english' which
 | |
| we added to pg_ts_cfg.</p>
 | |
| <p>There are several other lexem types used that we do not need to
 | |
| specify as using the ISpell dictionary. We can simply insert values
 | |
| using the 'simple' stemming process dictionary.</p>
 | |
| <pre>
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'url', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'host', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'sfloat', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'uri', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'int', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'float', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'email', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'word', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'hword', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'nlword', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'nlpart_hword', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'part_hword', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'nlhword', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'file', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'uint', '{simple}');
 | |
|         INSERT INTO pg_ts_cfgmap
 | |
|                VALUES ('default_english', 'version', '{simple}');
 | |
| </pre>
 | |
| 
 | |
| <p>Our addition of a configuration for 'default_english' is now
 | |
| complete. We have successfully created a new tsearch2
 | |
| configuration. At the same time we have also set the new
 | |
| configuration to be our default for en_US locale.</p>
 | |
| <pre>
 | |
|         SELECT to_tsvector('default_english',
 | |
|                            'learning tsearch is like going to school');
 | |
|                            to_tsvector
 | |
|         --------------------------------------------------
 | |
|          'go':5 'like':4 'learn':1 'school':7 'tsearch':2
 | |
|         SELECT to_tsvector('learning tsearch is like going to school');
 | |
|                             to_tsvector
 | |
|         --------------------------------------------------
 | |
|          'go':5 'like':4 'learn':1 'school':7 'tsearch':2
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Notice here that words like "tsearch" are still parsed and
 | |
| indexed in the tsvector column. There is a lexem returned for the
 | |
| word becuase in the configuration mapping table, we specify words
 | |
| to be used from the 'en_ispell' dictionary first, but as a fallback
 | |
| to use the 'en_stem' dictionary. Therefore a lexem is not returned
 | |
| from en_ispell, but is returned from en_stem, and added to the
 | |
| tsvector.</p>
 | |
| <pre>
 | |
|  SELECT to_tsvector('learning tsearch is like going to computer school');
 | |
|                                 to_tsvector
 | |
|         ---------------------------------------------------------------------------
 | |
|          'go':5 'like':4 'learn':1 'school':8 'compute':7 'tsearch':2 'computer':7
 | |
|         (1 row)
 | |
| </pre>
 | |
| <p>Notice in this last example I added the word "computer" to the
 | |
| text to be converted into a tsvector. Because we have setup our
 | |
| default configuration to use the ISpell english dictionary, the
 | |
| words are lexized, and computer returns 2 lexems at the same
 | |
| position. 'compute':7 and 'computer':7 are now both indexed for the
 | |
| word computer.</p>
 | |
| <p>You can create additional dictionary lists, or use the extra
 | |
| large dictionary from ISpell. You can read through the ISpell
 | |
| documents, and source tree to make modifications as you see
 | |
| fit.</p>
 | |
| <p>In the case that you already have a configuration set for the
 | |
| locale, and you are changing it to your new dictionary
 | |
| configuration. You will have to set the old locale to NULL. If we
 | |
| are using the 'C' locale then we would do this:</p>
 | |
| 
 | |
| <pre>
 | |
|         UPDATE pg_ts_cfg SET locale=NULL WHERE locale = 'C';
 | |
| </pre>
 | |
| <p>That about wraps up the configuration of tsearch2. There is much
 | |
| more you can do with the tables provided. This was just an
 | |
| introduction to get things working rather quickly.</p>
 | |
| <h3>ADDING NEW DICTIONARIES TO TSEARCH2</h3>
 | |
| <p>To aid in the addition of new dictionaries to the tsearch2
 | |
| module you can use another additional module in combination with
 | |
| tsearch2. The gendict module is included into tsearch2 distribution
 | |
| and is available from gendict/ subdirectory.</p>
 | |
| <p>I will not go into detail about installation and instructions on
 | |
| how to use gendict to it's fullest extent right now. You can read
 | |
| the README.gendict ... it has all of the instructions and
 | |
| information you will need.</p>
 | |
| <h3>BACKING UP AND RESTORING DATABASES THAT FEATURE TSEARCH2</h3>
 | |
| <p><strong>Never rely on anyone elses instructions to backup and
 | |
| restore a database system, always develop and understand your own
 | |
| methodology, and test it numerous times before you need to do it
 | |
| for real.</strong></p>
 | |
| <p>The backup and restore procedure has changed over time. This is
 | |
| not meant to be the bible for tsearch2 back up and restore. Please
 | |
| read all sections so you have a complete understanding of some
 | |
| backup and restore issues. Please test your own procedures, and do
 | |
| not rely on these instructions solely.</p>
 | |
| 
 | |
| <p>If you come accross some issues in your own procedures, please
 | |
| feel free to bring the question up on the Open-FTS, and PostgreSQL
 | |
| mailing lists.</p>
 | |
| <h3>ORIGINAL BACKUP PROCEDURES</h3>
 | |
| <p>Originally, tsearch2 had problems when using the pg_dump, and or
 | |
| the pg_dumpall utilities. The problem lies within the original use
 | |
| of OIDs for column types. Since OIDs are not consistent accross
 | |
| pg_dumps, when you reload the data values into the pg_ts_dict
 | |
| table, for example, those oids no longer point to anything. You
 | |
| would then end up trying to use a "broken" tsearch2
 | |
| configuration.</p>
 | |
| <p>The solution was to backup and restore a database using the
 | |
| tsearch2 module into small unique parts, and then load them in the
 | |
| correct order. You would have to edit the schema and remove the
 | |
| tsearch stored procedure references in the sql file. You would have
 | |
| to load your global objects, then the tsearch2 objects. You had to
 | |
| re-create the tsearch module before restoring your schema so no
 | |
| conflicts would arise. Then you could restore your data (all
 | |
| schemas, and types needed for the data were now available).</p>
 | |
| <p><strong>The original backup instructions were as
 | |
| follows</strong></p>
 | |
| <p>1) Backup any global database objects such as users and groups
 | |
| (this step is usually only necessary when you will be restoring to
 | |
| a virgin system)</p>
 | |
| <pre>
 | |
|         pg_dumpall -g > GLOBALobjects.sql
 | |
| 
 | |
| </pre>
 | |
| <p>2) Backup the full database schema using pg_dump</p>
 | |
| <pre>
 | |
|         pg_dump -s DATABASE > DATABASEschema.sql
 | |
| </pre>
 | |
| <p>3) Backup the full database using pg_dump</p>
 | |
| <pre>
 | |
|         pg_dump -Fc DATABASE > DATABASEdata.tar
 | |
| </pre>
 | |
| 
 | |
| <p><strong>The original restore procedures were as
 | |
| follows</strong></p>
 | |
| <p>1) Create the blank database</p>
 | |
| <pre>
 | |
|         createdb DATABASE
 | |
| </pre>
 | |
| <p>2) Restore any global database objects such as users and groups
 | |
| (this step is usually only necessary when you will be restoring to
 | |
| a virgin system)</p>
 | |
| <pre>
 | |
|         psql DATABASE < GLOBALobjects.sql
 | |
| </pre>
 | |
| <p>3) Create the tsearch2 objects, functions and operators</p>
 | |
| 
 | |
| <pre>
 | |
|         psql DATABASE < tsearch2.sql
 | |
| </pre>
 | |
| <p>4) Edit the backed up database schema and delete all SQL
 | |
| commands which create tsearch2 related functions, operators and
 | |
| data types, BUT NOT fields in table definitions that specify
 | |
| tsvector types. If your not sure what these are, they are the ones
 | |
| listed in tsearch2.sql. Then restore the edited schema to the
 | |
| database</p>
 | |
| <pre>
 | |
|         psql DATABASE < DATABASEschema.sql
 | |
| </pre>
 | |
| <p>5) Restore the data for the database</p>
 | |
| <pre>
 | |
| 
 | |
|         pg_restore -N -a -d DATABASE DATABASEdata.tar
 | |
| </pre>
 | |
| <p>If you get any errors in step 4, it will most likely be because
 | |
| you forgot to remove an object that was created in tsearch2.sql.
 | |
| Any errors in step 5 will mean the database schema was probably
 | |
| restored wrongly.</p>
 | |
| <p><strong>Issues with this procedure</strong></p>
 | |
| <p>As I mentioned before, it is vital that you test out your own
 | |
| backup and restore procedures. These procedures were originally
 | |
| adopted from this document's orignal author. Robert John Shepherd.
 | |
| It makes use of the pg_dump custom archive functionality. I am not
 | |
| that familiar with the formatting output of pg_dump, and using
 | |
| pg_restore. I have always had the luxury of using text files
 | |
| (Everything is DATABASE.sql).</p>
 | |
| <p>One issue not forseen in the case of using a binary dump is the
 | |
| when you have added more than the default tsearch2 configurations.
 | |
| Upon reload of the data it will fail due to duplicate primary keys.
 | |
| If you load the tsearch2 module, and then delete the data loaded by
 | |
| tsearch2 into the configuration tables, the data will restore. The
 | |
| configurations are incorrect because you can not remove the data
 | |
| using OID references from the custom archive.</p>
 | |
| <p>It would be very simple to fix this problem if the data was not
 | |
| in an archive format. I do believe all of your data would have been
 | |
| restored properly and you can get things working fairly easy. All
 | |
| one would have to do is create the configurations as in the
 | |
| tsearch2.sql file. And then create your custom configurations
 | |
| again.</p>
 | |
| <p>I have read in the pg_dump man page that if the tar archive
 | |
| format is used, it is possible to limit which data is restored
 | |
| using pg_restore. If anyone has more experience with pg_dump
 | |
| archives, and pg_restore. Please feel free to test and contribute
 | |
| your procedure(s).</p>
 | |
| <h3>CURRENT BACKUP AND RESTORE PROCEDURES</h3>
 | |
| 
 | |
| <p>Currently a backup and restore of a database using the tsearch2
 | |
| module can be quite simple. If you have applied the patch mentioned
 | |
| in the installation instructions prior to tsearch2 installation.
 | |
| This patch removes the use of the oid column. The text
 | |
| representation for the stored procedures used are dumped with the
 | |
| data and the restoration of the data works seemlessly.</p>
 | |
| <p>1) to backup the database</p>
 | |
| <pre>
 | |
|   pg_dump DATABASE > DATABASE.sql
 | |
| </pre>
 | |
| <p>1) to restore the database</p>
 | |
| <pre>
 | |
|      createdb DATABASE
 | |
|         psql -d DATABASE -f DATABASE.sql
 | |
| </pre>
 | |
| <p>This procedure is now like any normal backup and restore
 | |
| procedure. I can say whether this has been proven using the pg_dump
 | |
| archive, and restoring with pg_restore. In theory there should be
 | |
| no problems with any format after the patch is applied.</p>
 | |
| 
 | |
| <p>This restoration procedure should never be an issue with the
 | |
| patch applied to version 8.0 of PostgreSQL. Only versions 7.3 and
 | |
| 7.4 are affected. You can avoid any troubles by applying the patch
 | |
| prior to installation, or running the SQL script provided to live
 | |
| database before backup and restoring is done.</p>
 | |
| </div>
 | |
| </body>
 | |
| </html>
 |