mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			843 lines
		
	
	
		
			31 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			843 lines
		
	
	
		
			31 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 | 
						|
<html><head>
 | 
						|
 | 
						|
<title>tsearch2 reference</title></head>
 | 
						|
 | 
						|
<body>
 | 
						|
<h1 align="center">The tsearch2 Reference</h1>
 | 
						|
 | 
						|
<p align="center">
 | 
						|
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
 | 
						|
<br>Massive update for 8.2 release by Oleg Bartunov, October 2006
 | 
						|
</p>
 | 
						|
<p>
 | 
						|
This Reference documents the user types and functions
 | 
						|
of the tsearch2 module for PostgreSQL.
 | 
						|
An introduction to the module is provided
 | 
						|
by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
 | 
						|
a companion document to this one.
 | 
						|
</p>
 | 
						|
 | 
						|
<h2>Table of Contents</h2>
 | 
						|
<blockquote>
 | 
						|
<a href="#vq">Vectors and Queries</a><br>
 | 
						|
<a href="#vqo">Vector Operations</a><br>
 | 
						|
<a href="#qo">Query Operations</a><br>
 | 
						|
<a href="#fts">Full Text Search Operator</a><br>
 | 
						|
<a href="#configurations">Configurations</a><br>
 | 
						|
<a href="#testing">Testing</a><br>
 | 
						|
<a href="#parsers">Parsers</a><br>
 | 
						|
<a href="#dictionaries">Dictionaries</a><br>
 | 
						|
<a href="#ranking">Ranking</a><br>
 | 
						|
<a href="#headlines">Headlines</a><br>
 | 
						|
<a href="#indexes">Indexes</a><br>
 | 
						|
<a href="#tz">Thesaurus dictionary</a><br>
 | 
						|
</blockquote>
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
<h2><a name="vq">Vectors and Queries</a></h2>
 | 
						|
 | 
						|
Vectors and queries both store lexemes,
 | 
						|
but for different purposes.
 | 
						|
A <tt>tsvector</tt> stores the lexemes
 | 
						|
of the words that are parsed out of a document,
 | 
						|
and can also remember the position of each word.
 | 
						|
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
 | 
						|
<p>
 | 
						|
Any of the following functions with a <tt><i>configuration</i></tt> argument
 | 
						|
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
 | 
						|
to select a configuration;
 | 
						|
if the option is omitted, then the current configuration is used.
 | 
						|
For more information on the current configuration,
 | 
						|
read the next section on Configurations.
 | 
						|
</p>
 | 
						|
 | 
						|
<h3><a name="vqo">Vector Operations</a></h3>
 | 
						|
 | 
						|
<dl><dt>
 | 
						|
<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
 | 
						|
 <i>document</i> TEXT) RETURNS TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
 Parses a document into tokens,
 | 
						|
 reduces the tokens to lexemes,
 | 
						|
 and returns a <tt>tsvector</tt> which lists the lexemes
 | 
						|
 together with their positions in the document.
 | 
						|
 For the best description of this process,
 | 
						|
 see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
 | 
						|
 in the accompanying tsearch2 Guide.
 | 
						|
</dd><dt>
 | 
						|
 <tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
 Return a vector which lists the same lexemes
 | 
						|
 as the given <tt><i>vector</i></tt>,
 | 
						|
 but which lacks any information
 | 
						|
 about where in the document each lexeme appeared.
 | 
						|
 While the returned vector is thus useless for relevance ranking,
 | 
						|
 it will usually be much smaller.
 | 
						|
</dd><dt>
 | 
						|
 <tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
 This function returns a copy of the input vector
 | 
						|
 in which every location has been labeled
 | 
						|
 with either the <tt><i>letter</i></tt>
 | 
						|
 <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
 | 
						|
 or the default label <tt>'D'</tt>
 | 
						|
 (which is the default with which new vectors are created,
 | 
						|
 and as such is usually not displayed).
 | 
						|
 These labels are retained when vectors are concatenated,
 | 
						|
 allowing words from different parts of a document
 | 
						|
 to be weighted differently by ranking functions.
 | 
						|
</dd>
 | 
						|
<dt>
 | 
						|
 <tt><i>vector1</i> || <i>vector2</i></tt><BR>
 | 
						|
 <tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR)
 | 
						|
 RETURNS TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
 Returns a vector which combines the lexemes and position information
 | 
						|
 in the two vectors given as arguments.
 | 
						|
 Position weight labels (described in the previous paragraph)
 | 
						|
 are retained intact during the concatenation.
 | 
						|
 This has at least two uses.
 | 
						|
 First,
 | 
						|
 if some sections of your document
 | 
						|
 need be parsed with different configurations than others,
 | 
						|
 you can parse them separately
 | 
						|
 and concatenate the resulting vectors into one.
 | 
						|
 Second,
 | 
						|
 you can weight words from some sections of you document
 | 
						|
 more heavily than those from others by:
 | 
						|
 parsing the sections into separate vectors;
 | 
						|
 assigning the vectors different position labels
 | 
						|
 with the <tt>setweight()</tt> function;
 | 
						|
 concatenating them into a single vector;
 | 
						|
 and then providing a <tt><i>weights</i></tt> argument
 | 
						|
 to the <tt>rank()</tt> function
 | 
						|
 that assigns different weights to positions with different labels.
 | 
						|
</dd><dt>
 | 
						|
 <tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt>
 | 
						|
</dt><dd>
 | 
						|
 Returns the number of lexemes stored in the vector.
 | 
						|
</dd><dt>
 | 
						|
 <tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
 Directly casting text to a <tt>tsvector</tt>
 | 
						|
 allows you to directly inject lexemes into a vector,
 | 
						|
 with whatever positions and position weights you choose to specify.
 | 
						|
 The <tt><i>text</i></tt> should be formatted
 | 
						|
 like the vector would be printed by the output of a <tt>SELECT</tt>.
 | 
						|
 See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
 | 
						|
 section in the Guide for details.
 | 
						|
</dd><dt>
 | 
						|
 <tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt> 
 | 
						|
 </dt><dd>
 | 
						|
<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i>
 | 
						|
is the function name to preprocess <i>text_column_name</i>.  There are can be many
 | 
						|
functions  and text columns specified in <tt>tsearch2()</tt> trigger. 
 | 
						|
The following rule used: 
 | 
						|
function applied to all subsequent text columns until next function occurs.
 | 
						|
Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt>
 | 
						|
sign by space.
 | 
						|
<pre>
 | 
						|
CREATE FUNCTION dropatsymbol(text) RETURNS text 
 | 
						|
AS 'select replace($1, ''@'', '' '');'
 | 
						|
LANGUAGE SQL;
 | 
						|
 | 
						|
CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT 
 | 
						|
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE 
 | 
						|
tsearch2(tsvector_column,dropatsymbol, strMessage);
 | 
						|
</pre>
 | 
						|
</dd>
 | 
						|
 | 
						|
<dt>
 | 
						|
<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt>
 | 
						|
</dt><dd>
 | 
						|
Here <tt>statinfo</tt> is a type, defined as 
 | 
						|
<tt>
 | 
						|
CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4)
 | 
						|
</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>.
 | 
						|
<P>
 | 
						|
This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i> 
 | 
						|
in the collection) about column <i>vector</i> <tt>tsvector</tt>. 
 | 
						|
Useful to check how good is your configuration and
 | 
						|
to  find stop-words candidates.For example, find top 10 most frequent words:
 | 
						|
<pre>
 | 
						|
=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
 | 
						|
</pre>
 | 
						|
Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight.
 | 
						|
<pre>
 | 
						|
=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
 | 
						|
</pre>
 | 
						|
 | 
						|
</dd>
 | 
						|
<dt>
 | 
						|
<tt>TSVECTOR < TSVECTOR</tt><BR>
 | 
						|
<tt>TSVECTOR <= TSVECTOR</tt><BR>
 | 
						|
<tt>TSVECTOR = TSVECTOR</tt><BR>
 | 
						|
<tt>TSVECTOR >= TSVECTOR</tt><BR>
 | 
						|
<tt>TSVECTOR > TSVECTOR</tt>
 | 
						|
</dt><dd>
 | 
						|
All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares 
 | 
						|
with each other using lexicographical order.
 | 
						|
</dd>
 | 
						|
</dl>
 | 
						|
 | 
						|
<h3><a name="qo">Query Operations</a></h3>
 | 
						|
 | 
						|
<dl>
 | 
						|
<dt>
 | 
						|
 <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
 | 
						|
 <i>querytext</i> text) RETURNS TSQUERY[A</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
 Parses a query,
 | 
						|
 which should be single words separated by the boolean operators
 | 
						|
 "<tt>&</tt>" and,
 | 
						|
 "<tt>|</tt>" or,
 | 
						|
 and "<tt>!</tt>" not,
 | 
						|
 which can be grouped using parenthesis.
 | 
						|
 Each word is reduced to a lexeme using the current
 | 
						|
 or specified configuration. 
 | 
						|
 Weight class can be assigned to each lexeme entry
 | 
						|
 to restrict search region
 | 
						|
 (see <tt>setweight</tt> for explanation), for example 
 | 
						|
 "<tt>fat:a & rats</tt>".
 | 
						|
</dd><dt>
 | 
						|
<dt>
 | 
						|
 <tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
 | 
						|
 <i>querytext</i> text) RETURNS TSQUERY</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
Transforms unformatted text to tsquery. It is the same as to_tsquery, 
 | 
						|
but assumes "<tt>&</tt>" boolean operator between words and doesn't 
 | 
						|
recognizes weight classes.
 | 
						|
</dd><dt>
 | 
						|
 | 
						|
 <tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt>
 | 
						|
</dt><dd>
 | 
						|
This returns a query which actually used in searching in GiST index.
 | 
						|
</dd><dt>
 | 
						|
 <tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt>
 | 
						|
</dt><dd>
 | 
						|
 Directly casting text to a <tt>tsquery</tt>
 | 
						|
 allows you to directly inject lexemes into a query,
 | 
						|
 with whatever positions and position weight flags you choose to specify.
 | 
						|
 The <tt><i>text</i></tt> should be formatted
 | 
						|
 like the query would be printed by the output of a <tt>SELECT</tt>.
 | 
						|
 See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
 | 
						|
 section in the Guide for details.
 | 
						|
</dd>
 | 
						|
<dt>
 | 
						|
 <tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt>
 | 
						|
</dt><dd>
 | 
						|
This returns the number of nodes in query tree
 | 
						|
</dd><dt>
 | 
						|
 <tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt>
 | 
						|
</dt><dd>
 | 
						|
AND-ed TSQUERY
 | 
						|
</dd><dt>
 | 
						|
 <tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt>
 | 
						|
</dt> <dd>
 | 
						|
 OR-ed TSQUERY
 | 
						|
</dd><dt>
 | 
						|
 <tt>!! TSQUERY RETURNS TSQUERY</tt>
 | 
						|
</dt> <dd>
 | 
						|
 negation of TSQUERY
 | 
						|
</dd>
 | 
						|
<dt>
 | 
						|
<tt>TSQUERY < TSQUERY</tt><BR>
 | 
						|
<tt>TSQUERY <= TSQUERY</tt><BR>
 | 
						|
<tt>TSQUERY = TSQUERY</tt><BR>
 | 
						|
<tt>TSQUERY >= TSQUERY</tt><BR>
 | 
						|
<tt>TSQUERY > TSQUERY</tt>
 | 
						|
</dt><dd>
 | 
						|
All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares 
 | 
						|
with each other using lexicographical order.
 | 
						|
</dd>
 | 
						|
</dl>
 | 
						|
 | 
						|
<h3>Query rewriting</h3>
 | 
						|
Query rewriting is a set of functions and operators for tsquery type. 
 | 
						|
It allows to control search at query time without reindexing (opposite to thesaurus), for example,
 | 
						|
expand search using  synonyms (new york,  big apple, nyc, gotham).
 | 
						|
<P>
 | 
						|
<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>.
 | 
						|
There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt> 
 | 
						|
function can be column names of type <tt>tsquery</tt>.
 | 
						|
<pre>
 | 
						|
create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
 | 
						|
insert into rw values('a & b','a', 'c');
 | 
						|
</pre>
 | 
						|
<dl>
 | 
						|
<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
<pre>
 | 
						|
=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
 | 
						|
  rewrite
 | 
						|
  -----------
 | 
						|
   'c' & 'b'
 | 
						|
</pre>
 | 
						|
</dd>
 | 
						|
<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY])  RETURNS TSQUERY</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
<pre>
 | 
						|
=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
 | 
						|
  rewrite
 | 
						|
  -----------
 | 
						|
   'c' & 'b'
 | 
						|
</pre>
 | 
						|
</dd>
 | 
						|
<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text)  RETURNS TSQUERY</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
<pre>
 | 
						|
=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
 | 
						|
  rewrite
 | 
						|
  -----------
 | 
						|
   'c' & 'b'
 | 
						|
</pre>
 | 
						|
</dd>
 | 
						|
</dl>
 | 
						|
Two operators defined for <tt>tsquery</tt> type:
 | 
						|
<dl>
 | 
						|
<dt><tt>TSQUERY @ TSQUERY</tt></dt>
 | 
						|
<dd>
 | 
						|
  Returns <tt>TRUE</tt> if right agrument might contained in left argument.
 | 
						|
 </dd>
 | 
						|
 <dt><tt>TSQUERY ~ TSQUERY</tt></dt>
 | 
						|
 <dd> 
 | 
						|
  Returns <tt>TRUE</tt> if left agrument might contained in right argument.
 | 
						|
 </dd>
 | 
						|
</dl>                    
 | 
						|
To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass.
 | 
						|
<pre>
 | 
						|
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
 | 
						|
</pre>
 | 
						|
 | 
						|
<h2><a name="fts">Full Text Search operator</a></h2>
 | 
						|
 | 
						|
<dl><dt>
 | 
						|
<tt>TSQUERY @@ TSVECTOR</tt><br>
 | 
						|
<tt>TSVECTOR @@ TSQUERY</tt>
 | 
						|
</dt>
 | 
						|
<dd>
 | 
						|
Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and 
 | 
						|
<tt>FALSE</tt> otherwise.
 | 
						|
<pre>
 | 
						|
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 | 
						|
 ?column?
 | 
						|
 ----------
 | 
						|
  t
 | 
						|
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 | 
						|
 ?column?
 | 
						|
 ----------
 | 
						|
  f
 | 
						|
</pre>
 | 
						|
</dd>
 | 
						|
</dl>
 | 
						|
 | 
						|
<h2><a name="configurations">Configurations</a></h2>
 | 
						|
 | 
						|
A configuration specifies all of the equipment necessary
 | 
						|
to transform a document into a <tt>tsvector</tt>:
 | 
						|
the parser that breaks its text into tokens,
 | 
						|
and the dictionaries which then transform each token into a lexeme.
 | 
						|
Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above)
 | 
						|
uses a configuration to perform its processing.
 | 
						|
Three configurations come with tsearch2:
 | 
						|
 | 
						|
<ul>
 | 
						|
<li><b>default</b> -- Indexes words and numbers,
 | 
						|
 using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
 | 
						|
 and the <i>simple</i> dictionary for all others.
 | 
						|
</li><li><b>default_russian</b> -- Indexes words and numbers,
 | 
						|
 using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
 | 
						|
 and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default
 | 
						|
 for <tt>ru_RU.KOI8-R</tt> locale.
 | 
						|
</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but 
 | 
						|
for <tt>ru_RU.UTF-8</tt> locale.
 | 
						|
</li><li><b>simple</b> -- Processes both words and numbers
 | 
						|
 with the <i>simple</i> dictionary,
 | 
						|
 which neither discards any stop words nor alters them.
 | 
						|
</li></ul>
 | 
						|
 | 
						|
The tsearch2 modules initially chooses your current configuration
 | 
						|
by looking for your current locale in the <tt>locale</tt> field
 | 
						|
of the <tt>pg_ts_cfg</tt> table described below.
 | 
						|
You can manipulate the current configuration yourself with these functions:
 | 
						|
 | 
						|
<dl><dt>
 | 
						|
 <tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
 | 
						|
  ) RETURNS VOID</tt>
 | 
						|
</dt><dd>
 | 
						|
 Set the current configuration used by <tt>to_tsvector</tt>
 | 
						|
 and <tt>to_tsquery</tt>.
 | 
						|
</dd><dt>
 | 
						|
 <tt>show_curcfg() RETURNS INT4</tt>
 | 
						|
</dt><dd>
 | 
						|
 Returns the integer <tt>id</tt> of the current configuration.
 | 
						|
</dd></dl>
 | 
						|
 | 
						|
<p>
 | 
						|
Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:
 | 
						|
 | 
						|
</p><pre>create table pg_ts_cfg (
 | 
						|
	id		int not  null primary key,
 | 
						|
	ts_name		text not null,
 | 
						|
	prs_name	text not null,
 | 
						|
	locale		text
 | 
						|
);</pre>
 | 
						|
 | 
						|
The <tt>id</tt> and <tt>ts_name</tt> are unique values
 | 
						|
which identify the configuration;
 | 
						|
the <tt>prs_name</tt> specifies which parser the configuration uses.
 | 
						|
Once this parser has split document text into tokens,
 | 
						|
the type of each resulting token --
 | 
						|
or, more specifically, the type's <tt>tok_alias</tt>
 | 
						|
as specified in the parser's <tt>lexem_type()</tt> table --
 | 
						|
is searched for together with the configuration's <tt>ts_name</tt>
 | 
						|
in the <tt>pg_ts_cfgmap</tt> table:
 | 
						|
 | 
						|
<pre>create table pg_ts_cfgmap (
 | 
						|
	ts_name		text not null,
 | 
						|
	tok_alias	text not null,
 | 
						|
	dict_name	text[],
 | 
						|
	primary key (ts_name,tok_alias)
 | 
						|
);</pre>
 | 
						|
 | 
						|
Those tokens whose types are not listed are discarded.
 | 
						|
The remaining tokens are assigned integer positions,
 | 
						|
starting with 1 for the first token in the document,
 | 
						|
and turned into lexemes with the help of the dictionaries
 | 
						|
whose names are given in the <tt>dict_name</tt> array for their type.
 | 
						|
These dictionaries are tried in order,
 | 
						|
stopping either with the first one to return a lexeme for the token,
 | 
						|
or discarding the token if no dictionary returns a lexeme for it.
 | 
						|
 | 
						|
<h2><a name="testing">Testing</a></h2>
 | 
						|
 | 
						|
Function <tt>ts_debug</tt> allows easy testing of your <b>current</b> configuration. 
 | 
						|
You may always test another configuration using <tt>set_curcfg</tt> function.
 | 
						|
<p>
 | 
						|
Example:
 | 
						|
</p><pre>apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3');
 | 
						|
 ts_name | tok_type | description |   token    | dict_name |  tsvector    
 | 
						|
---------+----------+-------------+------------+-----------+--------------
 | 
						|
 default | lword    | Latin word  | Tsearch    | {en_stem} | 'tsearch'
 | 
						|
 default | lword    | Latin word  | module     | {en_stem} | 'modul'
 | 
						|
 default | lword    | Latin word  | for        | {en_stem} | 
 | 
						|
 default | lword    | Latin word  | PostgreSQL | {en_stem} | 'postgresql'
 | 
						|
 default | version  | VERSION     | 7.3.3      | {simple}  | '7.3.3'
 | 
						|
</pre>
 | 
						|
Here: 
 | 
						|
<br>
 | 
						|
<ul>
 | 
						|
<li>tsname - configuration name 
 | 
						|
</li><li>tok_type - token type 
 | 
						|
</li><li>description - human readable name of tok_type 
 | 
						|
</li><li>token - parser's token 
 | 
						|
</li><li>dict_name - dictionary used for the token 
 | 
						|
</li><li>tsvector - final result</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
 | 
						|
<h2><a name="parsers">Parsers</a></h2>
 | 
						|
 | 
						|
Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:
 | 
						|
 | 
						|
<pre>create table pg_ts_parser (
 | 
						|
	prs_name	text not null,
 | 
						|
	prs_start	regprocedure not null,
 | 
						|
	prs_nexttoken	regprocedure not null,
 | 
						|
	prs_end		regprocedure not null,
 | 
						|
	prs_headline	regprocedure not null,
 | 
						|
	prs_lextype	regprocedure not null,
 | 
						|
	prs_comment	text
 | 
						|
);</pre>
 | 
						|
 | 
						|
The <tt>prs_name</tt> uniquely identify the parser,
 | 
						|
while <tt>prs_comment</tt> usually describes its name and version
 | 
						|
for the reference of users.
 | 
						|
The other items identify the low-level functions
 | 
						|
which make the parser operate,
 | 
						|
and are only of interest to someone writing a parser of their own.
 | 
						|
<p>
 | 
						|
The tsearch2 module comes with one parser named <tt>default</tt>
 | 
						|
which is suitable for parsing most plain text and HTML documents.
 | 
						|
</p><p>
 | 
						|
Each <tt><i>parser</i></tt> argument below
 | 
						|
must designate a parser with <tt><i>prs_name</i></tt>;
 | 
						|
the current parser is used when this argument is omitted.
 | 
						|
 | 
						|
</p><dl><dt>
 | 
						|
 <tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
 | 
						|
</dt><dd>
 | 
						|
 Selects a current parser
 | 
						|
 which will be used when any of the following functions
 | 
						|
 are called without a parser as an argument.
 | 
						|
</dd><dt>
 | 
						|
 <tt>CREATE FUNCTION token_type(
 | 
						|
  <em>[</em> <i>parser</i> <em>]</em>
 | 
						|
  ) RETURNS SETOF tokentype</tt>
 | 
						|
</dt><dd>
 | 
						|
 Returns a table which defines and describes
 | 
						|
 each kind of token the parser may produce as output.
 | 
						|
 For each token type the table gives the <tt>tokid</tt>
 | 
						|
 which the parser will label each token of that type,
 | 
						|
 the <tt>alias</tt> which names the token type,
 | 
						|
 and a short description <tt>descr</tt> for the user to read.
 | 
						|
</dd><dt>
 | 
						|
 <tt>CREATE FUNCTION parse(
 | 
						|
  <em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
 | 
						|
  ) RETURNS SETOF tokenout</tt>
 | 
						|
</dt><dd>
 | 
						|
 Parses the given document and returns a series of records,
 | 
						|
 one for each token produced by parsing.
 | 
						|
 Each token includes a <tt>tokid</tt> giving its type
 | 
						|
 and a <tt>lexem</tt> which gives its content.
 | 
						|
</dd></dl>
 | 
						|
 | 
						|
<h2><a name="dictionaries">Dictionaries</a></h2>
 | 
						|
 | 
						|
Dictionary is a program, which accepts lexeme(s), usually those produced by a parser, 
 | 
						|
on input and returns:
 | 
						|
<ul>
 | 
						|
<li>array of lexeme(s) if input lexeme is known to the dictionary
 | 
						|
<li>void array - dictionary knows lexeme, but it's stop word.
 | 
						|
<li> NULL - dictionary doesn't recognized input lexeme
 | 
						|
</ul>
 | 
						|
Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
 | 
						|
but see, for example, <tt>intdict</tt> dictionary (available from 
 | 
						|
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page, 
 | 
						|
which controls indexing of integers.
 | 
						|
 | 
						|
<P>
 | 
						|
Among the dictionaries which come installed with tsearch2 are:
 | 
						|
 | 
						|
<ul>
 | 
						|
<li><b>simple</b> simply folds uppercase letters to lowercase
 | 
						|
 before returning the word.
 | 
						|
</li>
 | 
						|
<li><b>ispell_template</b> - template for ispell dictionaries.
 | 
						|
</li>
 | 
						|
<li><b>en_stem</b> runs an English Snowball stemmer on each word
 | 
						|
 that attempts to reduce the various forms of a verb or noun
 | 
						|
 to a single recognizable form.
 | 
						|
</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word.
 | 
						|
</li>
 | 
						|
<li><b>synonym</b> - simple lexeme-to-lexeme replacement
 | 
						|
</li>
 | 
						|
<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's 
 | 
						|
phrase-to-phrase replacement
 | 
						|
</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<P>
 | 
						|
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
 | 
						|
 | 
						|
<pre>CREATE TABLE pg_ts_dict (
 | 
						|
	dict_name	text not null,
 | 
						|
	dict_init	regprocedure,
 | 
						|
	dict_initoption	text,
 | 
						|
	dict_lexize	regprocedure not null,
 | 
						|
	dict_comment	text
 | 
						|
);</pre>
 | 
						|
 | 
						|
The <tt>dict_name</tt>
 | 
						|
serve as unique identifiers for the dictionary.
 | 
						|
The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
 | 
						|
but for the built-in Snowball dictionaries
 | 
						|
it specifies a file from which stop words should be read.
 | 
						|
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
 | 
						|
The other fields are internal function identifiers
 | 
						|
useful only to developers trying to implement their own dictionaries.
 | 
						|
 | 
						|
<blockquote>
 | 
						|
<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to 
 | 
						|
avoid possible problems !
 | 
						|
</blockquote>
 | 
						|
 | 
						|
<p>
 | 
						|
The argument named <tt><i>dictionary</i></tt>
 | 
						|
in each of the following functions
 | 
						|
should be <tt>dict_name</tt>
 | 
						|
identifying which dictionary should be used for the operation;
 | 
						|
if omitted then the current dictionary is used.
 | 
						|
 | 
						|
</p><dl><dt>
 | 
						|
 <tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
 | 
						|
</dt><dd>
 | 
						|
 Selects a current dictionary for use by functions
 | 
						|
 that do not select a dictionary explicitly.
 | 
						|
</dd><dt>
 | 
						|
 <tt>CREATE FUNCTION lexize(
 | 
						|
 <em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
 | 
						|
 RETURNS TEXT[]</tt>
 | 
						|
</dt><dd>
 | 
						|
 Reduces a single word to a lexeme.
 | 
						|
 Note that lexemes are arrays of zero or more strings,
 | 
						|
 since in some languages there might be several base words
 | 
						|
 from which an inflected form could arise.
 | 
						|
</dd></dl>
 | 
						|
 | 
						|
<h3>Using dictionaries template</h3>
 | 
						|
Templates used to define new dictionaries, for example,
 | 
						|
<pre>
 | 
						|
INSERT INTO pg_ts_dict
 | 
						|
               (SELECT 'en_ispell', dict_init,
 | 
						|
                       'DictFile="/usr/local/share/dicts/ispell/english.dict",'
 | 
						|
                       'AffFile="/usr/local/share/dicts/ispell/english.aff",'
 | 
						|
                       'StopFile="/usr/local/share/dicts/english.stop"',
 | 
						|
                       dict_lexize
 | 
						|
               FROM pg_ts_dict
 | 
						|
               WHERE dict_name = 'ispell_template');
 | 
						|
</pre>
 | 
						|
 | 
						|
<h3>Working with stop words</h3>
 | 
						|
Ispell and snowball stemmers treat stop words differently:
 | 
						|
<ul>
 | 
						|
<li>ispell - normalize word and then lookups normalized form in stop-word file
 | 
						|
<li>snowball stemmer - first, it lookups word in stop-word file and then does it job. 
 | 
						|
The reason - to minimize possible 'noise'.
 | 
						|
</ul>
 | 
						|
 | 
						|
<h2><a name="ranking">Ranking</a></h2>
 | 
						|
 | 
						|
Ranking attempts to measure how relevant documents are to particular queries
 | 
						|
by inspecting the number of times each search word appears in the document,
 | 
						|
and whether different search terms occur near each other.
 | 
						|
Note that this information is only available in unstripped vectors --
 | 
						|
ranking functions will only return a useful result
 | 
						|
for a <tt>tsvector</tt> which still has position information!
 | 
						|
<p>
 | 
						|
Notice, that ranking functions supplied are just an examples and 
 | 
						|
doesn't belong to the tsearch2 core, you can
 | 
						|
write your very own ranking function and/or combine additional
 | 
						|
factors to fit your specific interest.
 | 
						|
</p>
 | 
						|
 | 
						|
The two ranking functions currently available are:
 | 
						|
 | 
						|
<dl><dt>
 | 
						|
 <tt>CREATE FUNCTION rank(<br>
 | 
						|
  <em>[</em> <i>weights</i> float4[], <em>]</em>
 | 
						|
  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
 | 
						|
  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
 | 
						|
  ) RETURNS float4</tt>
 | 
						|
</dt><dd>
 | 
						|
 This is the ranking function from the old version of OpenFTS,
 | 
						|
 and offers the ability to weight word instances more heavily
 | 
						|
 depending on how you have classified them.
 | 
						|
 The <i>weights</i> specify how heavily to weight each category of word:
 | 
						|
 <pre>{<i>D-weight</i>, <i>C-weight</i>, <i>B-weight</i>, <i>A-weight</i>}</pre>
 | 
						|
 If no weights are provided, then these defaults are used:
 | 
						|
 <pre>{0.1, 0.2, 0.4, 1.0}</pre>
 | 
						|
 Often weights are used to mark words from special areas of the document,
 | 
						|
 like the title or an initial abstract,
 | 
						|
 and make them more or less important than words in the document body.
 | 
						|
</dd><dt>
 | 
						|
 <tt>CREATE FUNCTION rank_cd(<br>
 | 
						|
  <em>[</em> <i>weights</i> float4[], <em>]</em>
 | 
						|
  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
 | 
						|
  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
 | 
						|
  ) RETURNS float4</tt>
 | 
						|
</dt><dd>
 | 
						|
 This function computes the cover density ranking
 | 
						|
 for the given document <i>vector</i> and <i>query</i>,
 | 
						|
 as described in Clarke, Cormack, and Tudhope's
 | 
						|
 "<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
 | 
						|
 in the 1999 <i>Information Processing and Management</i>.
 | 
						|
</dd>
 | 
						|
<dt>
 | 
						|
 <tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt>
 | 
						|
 </dt>
 | 
						|
 <dd>
 | 
						|
 Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query. 
 | 
						|
 Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking.
 | 
						|
 In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>.
 | 
						|
 <pre>
 | 
						|
=# select get_covers('1:1,2,10 2:4'::tsvector,'1& 2');
 | 
						|
get_covers
 | 
						|
----------------------
 | 
						|
1 {1 1 {2 2 }1 1 }2
 | 
						|
</pre>
 | 
						|
 </dd>
 | 
						|
 | 
						|
</dl>
 | 
						|
 | 
						|
<p>
 | 
						|
Both of these (<tt>rank(), rank_cd()</tt>) ranking functions
 | 
						|
take an integer <i>normalization</i> option
 | 
						|
that specifies whether a document's length should impact its rank.
 | 
						|
This is often desirable,
 | 
						|
since a hundred-word document with five instances of a search word
 | 
						|
is probably more relevant than a thousand-word document with five instances.
 | 
						|
The option can have the values, which could be combined using "|" ( 2|4) to
 | 
						|
take into account several factors:
 | 
						|
 | 
						|
</p>
 | 
						|
<ul>
 | 
						|
<li><tt>0</tt> (the default) ignores document length.</li>
 | 
						|
<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li>
 | 
						|
<li><tt>2</tt> divides the rank by the length itself.</li>
 | 
						|
<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li>
 | 
						|
<li><tt>8</tt> divides the rank by the  number of unique words in document</li>
 | 
						|
<li><tt>16</tt> divides the rank by 1 + logarithm of the  number of unique words in document
 | 
						|
</li>
 | 
						|
</ul>
 | 
						|
 | 
						|
<h2><a name="headlines">Headlines</a></h2>
 | 
						|
 | 
						|
<dl><dt>
 | 
						|
 <tt>CREATE FUNCTION headline(<br>
 | 
						|
  <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
 | 
						|
  <i>document</i> text, <i>query</i> TSQUERY,
 | 
						|
  <em>[</em> <i>options</i> text <em>]</em><br>
 | 
						|
  ) RETURNS text</tt>
 | 
						|
</dt><dd>
 | 
						|
 Every form of the the <tt>headline()</tt> function
 | 
						|
 accepts a <tt>document</tt> along with a <tt>query</tt>,
 | 
						|
 and returns one or more ellipse-separated excerpts from the document
 | 
						|
 in which terms from the query are highlighted.
 | 
						|
 The configuration with which to parse the document
 | 
						|
 can be specified by either its <i>id</i> or <i>ts_name</i>;
 | 
						|
 if none is specified that the current configuration is used instead.
 | 
						|
 <p>
 | 
						|
 An <i>options</i> string if provided should be a comma-separated list
 | 
						|
 of one or more '<i>option</i><tt>=</tt><i>value</i>' pairs.
 | 
						|
 The available options are:
 | 
						|
 </p><ul>
 | 
						|
  <li><tt>StartSel</tt>, <tt>StopSel</tt> --
 | 
						|
   the strings with which query words appearing in the document
 | 
						|
   should be delimited to distinguish them from other excerpted words.
 | 
						|
  </li><li><tt>MaxWords</tt>, <tt>MinWords</tt> --
 | 
						|
   limits on the shortest and longest headlines you will accept.
 | 
						|
  </li><li><tt>ShortWord</tt> --
 | 
						|
   this prevents your headline from beginning or ending
 | 
						|
   with a word which has this many characters or less.
 | 
						|
   The default value of <tt>3</tt> should eliminate most English
 | 
						|
   conjunctions and articles.
 | 
						|
  </li><li><tt>HighlightAll</tt> -- 
 | 
						|
   boolean flag, if TRUE, than the whole document will be highlighted.
 | 
						|
 </li></ul>
 | 
						|
 Any unspecified options receive these defaults:
 | 
						|
 <pre>StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
 | 
						|
 </pre>
 | 
						|
</dd></dl>
 | 
						|
 | 
						|
 | 
						|
<h2><a name="indexes">Indexes</a></h2>
 | 
						|
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! 
 | 
						|
<ul>
 | 
						|
<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree) 
 | 
						|
<pre>    
 | 
						|
    =# create index fts_idx on apod using gist(fts);
 | 
						|
</pre>    
 | 
						|
<li>GIN - Generalized Inverted Index 
 | 
						|
<pre>       
 | 
						|
        =# create index fts_idx on apod using gin(fts);
 | 
						|
</pre>
 | 
						|
</ul>  
 | 
						|
<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index,
 | 
						|
which, in turn, isn't good for updates. Both indexes support concurrency and recovery.
 | 
						|
 | 
						|
<h2><a name="tz">Thesaurus dictionary</a></h2>
 | 
						|
 | 
						|
<P>
 | 
						|
Thesaurus - is a collection of words with included information about the relationships of words and phrases, 
 | 
						|
i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p>
 | 
						|
<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, 
 | 
						|
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
 | 
						|
Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary 
 | 
						|
with <b>phrase</b> support. Thesaurus is a plain file of the following format: 
 | 
						|
<pre>
 | 
						|
# this is a comment 
 | 
						|
sample word(s) : indexed word(s)
 | 
						|
...............................
 | 
						|
</pre>
 | 
						|
<ul>
 | 
						|
<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
 | 
						|
<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
 | 
						|
It's still required, that <tt>sample words</tt> should be known.</li>
 | 
						|
<li>thesaurus dictionary looks for the most longest match</li></ul>
 | 
						|
<P>
 | 
						|
TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration) 
 | 
						|
to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>. 
 | 
						|
Notice, that subdictionary produces an error, if it couldn't recognize word. 
 | 
						|
In that case, you should remove definition line with this word or teach  subdictionary to know it. 
 | 
						|
</p>
 | 
						|
<p>Stop-words recognized by subdictionary replaced by  'stop-word placeholder', i.e., 
 | 
						|
important only their position.
 | 
						|
To break possible ties thesaurus applies the last definition. For example, consider 
 | 
						|
thesaurus (with simple subdictionary) rules with pattern 'swsw' 
 | 
						|
('s' designates stop-word and 'w' - known word): </p>
 | 
						|
<pre>
 | 
						|
a one the two : swsw
 | 
						|
the one a two : swsw2
 | 
						|
</pre>
 | 
						|
<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. 
 | 
						|
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and  will use definition 
 | 
						|
'swsw2'.</p>
 | 
						|
<p>As a normal dictionary, it should be assigned to the specific lexeme types. 
 | 
						|
Since TZ has a capability to recognize phrases it must remember its  state and interact with parser. 
 | 
						|
TZ use these assignments to check if it should handle next word or stop accumulation. 
 | 
						|
Compiler of TZ should take care about proper configuration to avoid confusion. 
 | 
						|
For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like 
 | 
						|
' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>
 | 
						|
 | 
						|
<h3>Configuration</h3>
 | 
						|
 | 
						|
<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
 | 
						|
<pre class="real">INSERT INTO pg_ts_dict
 | 
						|
               (SELECT 'tz_simple', dict_init,
 | 
						|
                        'DictFile="/path/to/tz_simple.txt",'
 | 
						|
                        'Dictionary="en_stem"',
 | 
						|
                       dict_lexize
 | 
						|
                FROM pg_ts_dict
 | 
						|
                WHERE dict_name = 'thesaurus_template');
 | 
						|
 | 
						|
</pre>
 | 
						|
<p>Here: </p>
 | 
						|
<ul>
 | 
						|
<li><tt>tz_simple</tt> - is the dictionary name</li>
 | 
						|
<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
 | 
						|
<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
 | 
						|
</ul>
 | 
						|
<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for  example: </p>
 | 
						|
<pre>
 | 
						|
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
 | 
						|
tok_alias in ('lhword', 'lword', 'lpart_hword');
 | 
						|
</pre>
 | 
						|
<h3>Examples</h3>
 | 
						|
<p>tz_simple: </p>
 | 
						|
<pre>
 | 
						|
one : 1
 | 
						|
two : 2
 | 
						|
one two : 12
 | 
						|
the one : 1
 | 
						|
one 1 : 11
 | 
						|
</pre>
 | 
						|
<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday');
 | 
						|
    plainto_tsquery
 | 
						|
------------------------
 | 
						|
 '1' & 'day' & 'oneday'
 | 
						|
 | 
						|
=# select plainto_tsquery('default_russian','one two day is oneday');
 | 
						|
     plainto_tsquery
 | 
						|
-------------------------
 | 
						|
 '12' & 'day' & 'oneday'
 | 
						|
 | 
						|
=# select plainto_tsquery('default_russian','the one');
 | 
						|
NOTICE:  Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
 | 
						|
 plainto_tsquery
 | 
						|
-----------------
 | 
						|
 '1'
 | 
						|
</pre>
 | 
						|
 | 
						|
Additional information about thesaurus dictionary is available from
 | 
						|
<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page.
 | 
						|
</body></html>
 |