mirror of
https://github.com/postgres/postgres.git
synced 2025-07-28 23:42:10 +03:00
Move most /contrib README files into SGML. Some still need conversion
or will never be converted.
This commit is contained in:
@ -1,144 +0,0 @@
|
||||
trgm - Trigram matching for PostgreSQL
|
||||
--------------------------------------
|
||||
|
||||
Introduction
|
||||
|
||||
This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
|
||||
|
||||
The pg_trgm contrib module provides functions and index classes
|
||||
for determining the similarity of text based on trigram
|
||||
matching.
|
||||
|
||||
Definitions
|
||||
|
||||
Trigram (or Trigraph)
|
||||
|
||||
A trigram is a set of three consecutive characters taken
|
||||
from a string. A string is considered to have two spaces
|
||||
prefixed and one space suffixed when determining the set
|
||||
of trigrams that comprise the string.
|
||||
|
||||
eg. The set of trigrams in the word "cat" is " c", " ca",
|
||||
"at " and "cat".
|
||||
|
||||
Public Functions
|
||||
|
||||
real similarity(text, text)
|
||||
|
||||
Returns a number that indicates how closely matches the two
|
||||
arguments are. A zero result indicates that the two words
|
||||
are completely dissimilar, and a result of one indicates that
|
||||
the two words are identical.
|
||||
|
||||
real show_limit()
|
||||
|
||||
Returns the current similarity threshold used by the '%'
|
||||
operator. This in effect sets the minimum similarity between
|
||||
two words in order that they be considered similar enough to
|
||||
be misspellings of each other, for example.
|
||||
|
||||
real set_limit(real)
|
||||
|
||||
Sets the current similarity threshold that is used by the '%'
|
||||
operator, and is returned by the show_limit() function.
|
||||
|
||||
text[] show_trgm(text)
|
||||
|
||||
Returns an array of all the trigrams of the supplied text
|
||||
parameter.
|
||||
|
||||
Public Operators
|
||||
|
||||
text % text (returns boolean)
|
||||
|
||||
The '%' operator returns TRUE if its two arguments have a similarity
|
||||
that is greater than the similarity threshold set by set_limit(). It
|
||||
will return FALSE if the similarity is less than the current
|
||||
threshold.
|
||||
|
||||
Public Index Operator Classes
|
||||
|
||||
gist_trgm_ops
|
||||
|
||||
The pg_trgm module comes with an index operator class that allows a
|
||||
developer to create an index over a text column for the purpose
|
||||
of very fast similarity searches.
|
||||
|
||||
To use this index, the '%' operator must be used and an appropriate
|
||||
similarity threshold for the application must be set.
|
||||
|
||||
eg.
|
||||
|
||||
CREATE TABLE test_trgm (t text);
|
||||
CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
|
||||
|
||||
At this point, you will have an index on the t text column that you
|
||||
can use for similarity searching.
|
||||
|
||||
eg.
|
||||
|
||||
SELECT
|
||||
t,
|
||||
similarity(t, 'word') AS sml
|
||||
FROM
|
||||
test_trgm
|
||||
WHERE
|
||||
t % 'word'
|
||||
ORDER BY
|
||||
sml DESC, t;
|
||||
|
||||
This will return all values in the text column that are sufficiently
|
||||
similar to 'word', sorted from best match to worst. The index will
|
||||
be used to make this a fast operation over very large data sets.
|
||||
|
||||
Tsearch2 Integration
|
||||
|
||||
Trigram matching is a very useful tool when used in conjunction
|
||||
with a text index created by the Tsearch2 contrib module. (See
|
||||
contrib/tsearch2)
|
||||
|
||||
The first step is to generate an auxiliary table containing all
|
||||
the unique words in the Tsearch2 index:
|
||||
|
||||
CREATE TABLE words AS SELECT word FROM
|
||||
stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
|
||||
|
||||
Where 'documents' is a table that has a text field 'bodytext'
|
||||
that TSearch2 is used to search. The use of the 'simple' dictionary
|
||||
with the to_tsvector function, instead of just using the already
|
||||
existing vector is to avoid creating a list of already stemmed
|
||||
words. This way, only the original, unstemmed words are added
|
||||
to the word list.
|
||||
|
||||
Next, create a trigram index on the word column:
|
||||
|
||||
CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
|
||||
or
|
||||
CREATE INDEX words_idx ON words USING gin(word gist_trgm_ops);
|
||||
|
||||
Now, a SELECT query similar to the example above can be used to
|
||||
suggest spellings for misspelled words in user search terms. A
|
||||
useful extra clause is to ensure that the similar words are also
|
||||
of similar length to the misspelled word.
|
||||
|
||||
Note: Since the 'words' table has been generated as a separate,
|
||||
static table, it will need to be periodically regenerated so that
|
||||
it remains up to date with the word list in the Tsearch2 index.
|
||||
|
||||
Authors
|
||||
|
||||
Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
|
||||
Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
|
||||
|
||||
Contributors
|
||||
|
||||
Christopher Kings-Lynne wrote this README file
|
||||
|
||||
References
|
||||
|
||||
Tsearch2 Development Site
|
||||
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
|
||||
|
||||
GiST Development Site
|
||||
http://www.sai.msu.su/~megera/postgres/gist/
|
||||
|
Reference in New Issue
Block a user