1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

Support indexing of regular-expression searches in contrib/pg_trgm.

This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.

Currently, only GIN indexes are supported.  We might be able to make
it work with GiST indexes later.

The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.

Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
This commit is contained in:
Tom Lane
2013-04-09 01:05:55 -04:00
parent e60d20a35e
commit 3ccae48f44
17 changed files with 2865 additions and 43 deletions

View File

@ -9,7 +9,7 @@
<para>
The <filename>pg_trgm</filename> module provides functions and operators
for determining the similarity of <acronym>ASCII</>
for determining the similarity of
alphanumeric text based on trigram matching, as
well as index operator classes that support fast searching for similar
strings.
@ -28,7 +28,9 @@
<note>
<para>
A string is considered to have two spaces
<filename>pg_trgm</filename> ignores non-word characters
(non-alphanumerics) when extracting trigrams from a string.
Each word is considered to have two spaces
prefixed and one space suffixed when determining the set
of trigrams contained in the string.
For example, the set of trigrams in the string
@ -37,6 +39,16 @@
<quote><literal> ca</literal></quote>,
<quote><literal>cat</literal></quote>, and
<quote><literal>at </literal></quote>.
The set of trigrams in the string
<quote><literal>foo|bar</literal></quote> is
<quote><literal> f</literal></quote>,
<quote><literal> fo</literal></quote>,
<quote><literal>foo</literal></quote>,
<quote><literal>oo </literal></quote>,
<quote><literal> b</literal></quote>,
<quote><literal> ba</literal></quote>,
<quote><literal>bar</literal></quote>, and
<quote><literal>ar </literal></quote>.
</para>
</note>
</sect2>
@ -145,9 +157,10 @@
operator classes that allow you to create an index over a text column for
the purpose of very fast similarity searches. These index types support
the above-described similarity operators, and additionally support
trigram-based index searches for <literal>LIKE</> and <literal>ILIKE</>
queries. (These indexes do not support equality nor simple comparison
operators, so you may need a regular B-tree index too.)
trigram-based index searches for <literal>LIKE</>, <literal>ILIKE</>,
<literal>~</> and <literal>~*</> queries. (These indexes do not
support equality nor simple comparison operators, so you may need a
regular B-tree index too.)
</para>
<para>
@ -202,6 +215,26 @@ SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
searches, the search string need not be left-anchored.
</para>
<para>
Beginning in <productname>PostgreSQL</> 9.3, <filename>pg_trgm</filename>
GIN indexes also support index searches for regular-expression matches
(<literal>~</> and <literal>~*</> operators), for example
<programlisting>
SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
</programlisting>
The index search works by extracting trigrams from the regular expression
and then looking these up in the index. The more trigrams that can be
extracted from the regular expression, the more effective the index search
is. Unlike B-tree based searches, the search string need not be
left-anchored.
</para>
<para>
For both <literal>LIKE</> and regular-expression searches, keep in mind
that a pattern with no extractable trigrams will degenerate to a full-index
scan.
</para>
<para>
The choice between GiST and GIN indexing depends on the relative
performance characteristics of GiST and GIN, which are discussed elsewhere.