mirror of
https://github.com/postgres/postgres.git
synced 2025-07-28 23:42:10 +03:00
Phrase full text search.
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery. On-disk and binary in/out format of tsquery are backward compatible. It has two side effect: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex it - less number of parenthesis in tsquery output, and tsquery becomes more readable Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov Reviewers: Alexander Korotkov, Artur Zakirov
This commit is contained in:
@ -3924,8 +3924,9 @@ SELECT to_tsvector('english', 'The Fat Rats');
|
||||
<para>
|
||||
A <type>tsquery</type> value stores lexemes that are to be
|
||||
searched for, and combines them honoring the Boolean operators
|
||||
<literal>&</literal> (AND), <literal>|</literal> (OR), and
|
||||
<literal>!</> (NOT). Parentheses can be used to enforce grouping
|
||||
<literal>&</literal> (AND), <literal>|</literal> (OR),
|
||||
<literal>!</> (NOT) and <literal><-></> (FOLLOWED BY) phrase search
|
||||
operator. Parentheses can be used to enforce grouping
|
||||
of the operators:
|
||||
|
||||
<programlisting>
|
||||
@ -3946,8 +3947,8 @@ SELECT 'fat & rat & ! cat'::tsquery;
|
||||
</programlisting>
|
||||
|
||||
In the absence of parentheses, <literal>!</> (NOT) binds most tightly,
|
||||
and <literal>&</literal> (AND) binds more tightly than
|
||||
<literal>|</literal> (OR).
|
||||
and <literal>&</literal> (AND) and <literal><-></literal> (FOLLOWED BY)
|
||||
both bind more tightly than <literal>|</literal> (OR).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -9127,6 +9127,12 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
|
||||
<entry><literal>!! 'cat'::tsquery</literal></entry>
|
||||
<entry><literal>!'cat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry> <literal><-></literal> </entry>
|
||||
<entry><type>tsquery</> followed by <type>tsquery</></entry>
|
||||
<entry><literal>to_tsquery('fat') <-> to_tsquery('rat')</literal></entry>
|
||||
<entry><literal>'fat' <-> 'rat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry> <literal>@></literal> </entry>
|
||||
<entry><type>tsquery</> contains another ?</entry>
|
||||
@ -9219,6 +9225,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
|
||||
<entry><literal>plainto_tsquery('english', 'The Fat Rats')</literal></entry>
|
||||
<entry><literal>'fat' & 'rat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm>
|
||||
<primary>phraseto_tsquery</primary>
|
||||
</indexterm>
|
||||
<literal><function>phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</> <type>regconfig</> , </optional> <replaceable class="PARAMETER">query</> <type>text</type>)</function></literal>
|
||||
</entry>
|
||||
<entry><type>tsquery</type></entry>
|
||||
<entry>produce <type>tsquery</> ignoring punctuation</entry>
|
||||
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
|
||||
<entry><literal>'fat' <-> 'rat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm>
|
||||
@ -9421,6 +9439,27 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
|
||||
<entry><literal>SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases')</literal></entry>
|
||||
<entry><literal>'b' & ( 'foo' | 'bar' )</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm>
|
||||
<primary>tsquery_phrase</primary>
|
||||
</indexterm>
|
||||
<literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>)</function></literal>
|
||||
</entry>
|
||||
<entry><type>tsquery</type></entry>
|
||||
<entry>implementation of <literal><-></> (FOLLOWED BY) operator</entry>
|
||||
<entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'))</literal></entry>
|
||||
<entry><literal>'fat' <-> 'cat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<literal><function>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">distance</replaceable> <type>integer</>)</function></literal>
|
||||
</entry>
|
||||
<entry><type>tsquery</type></entry>
|
||||
<entry>phrase-concatenate with distance</entry>
|
||||
<entry><literal>tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10)</literal></entry>
|
||||
<entry><literal>'fat' <10> 'cat'</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm>
|
||||
|
@ -263,9 +263,10 @@ SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t
|
||||
As the above example suggests, a <type>tsquery</type> is not just raw
|
||||
text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
|
||||
contains search terms, which must be already-normalized lexemes, and
|
||||
may combine multiple terms using AND, OR, and NOT operators.
|
||||
may combine multiple terms using AND, OR, NOT and FOLLOWED BY operators.
|
||||
(For details see <xref linkend="datatype-textsearch">.) There are
|
||||
functions <function>to_tsquery</> and <function>plainto_tsquery</>
|
||||
functions <function>to_tsquery</>, <function>plainto_tsquery</>
|
||||
and <function>phraseto_tsquery</>
|
||||
that are helpful in converting user-written text into a proper
|
||||
<type>tsquery</type>, for example by normalizing words appearing in
|
||||
the text. Similarly, <function>to_tsvector</> is used to parse and
|
||||
@ -293,6 +294,35 @@ SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat');
|
||||
already normalized, so <literal>rats</> does not match <literal>rat</>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Phrase search is made possible with the help of the <literal><-></>
|
||||
(FOLLOWED BY) operator, which enforces lexeme order. This allows you
|
||||
to discard strings not containing the desired phrase, for example:
|
||||
|
||||
<programlisting>
|
||||
SELECT q @@ to_tsquery('fatal <-> error')
|
||||
FROM unnest(array[to_tsvector('fatal error'),
|
||||
to_tsvector('error is not fatal')]) AS q;
|
||||
?column?
|
||||
----------
|
||||
t
|
||||
f
|
||||
</programlisting>
|
||||
|
||||
A more generic version of the FOLLOWED BY operator takes form of
|
||||
<literal><N></>, where N stands for the greatest allowed distance
|
||||
between the specified lexemes. The <literal>phraseto_tsquery</>
|
||||
function makes use of this behavior in order to construct a
|
||||
<literal>tsquery</> capable of matching the provided phrase:
|
||||
|
||||
<programlisting>
|
||||
SELECT phraseto_tsquery('cat ate some rats');
|
||||
phraseto_tsquery
|
||||
-------------------------------
|
||||
( 'cat' <-> 'ate' ) <2> 'rat'
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <literal>@@</literal> operator also
|
||||
supports <type>text</type> input, allowing explicit conversion of a text
|
||||
@ -709,11 +739,14 @@ UPDATE tt SET ti =
|
||||
|
||||
<para>
|
||||
<productname>PostgreSQL</productname> provides the
|
||||
functions <function>to_tsquery</function> and
|
||||
<function>plainto_tsquery</function> for converting a query to
|
||||
the <type>tsquery</type> data type. <function>to_tsquery</function>
|
||||
offers access to more features than <function>plainto_tsquery</function>,
|
||||
but is less forgiving about its input.
|
||||
functions <function>to_tsquery</function>,
|
||||
<function>plainto_tsquery</function> and
|
||||
<function>phraseto_tsquery</function>
|
||||
for converting a query to the <type>tsquery</type> data type.
|
||||
<function>to_tsquery</function> offers access to more features
|
||||
than both <function>plainto_tsquery</function> and
|
||||
<function>phraseto_tsquery</function>, but is less forgiving
|
||||
about its input.
|
||||
</para>
|
||||
|
||||
<indexterm>
|
||||
@ -728,7 +761,8 @@ to_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>
|
||||
<function>to_tsquery</function> creates a <type>tsquery</> value from
|
||||
<replaceable>querytext</replaceable>, which must consist of single tokens
|
||||
separated by the Boolean operators <literal>&</literal> (AND),
|
||||
<literal>|</literal> (OR) and <literal>!</literal> (NOT). These operators
|
||||
<literal>|</literal> (OR), <literal>!</literal> (NOT), and also the
|
||||
<literal><-></literal> (FOLLOWED BY) phrase search operator. These operators
|
||||
can be grouped using parentheses. In other words, the input to
|
||||
<function>to_tsquery</function> must already follow the general rules for
|
||||
<type>tsquery</> input, as described in <xref
|
||||
@ -814,8 +848,8 @@ SELECT plainto_tsquery('english', 'The Fat Rats');
|
||||
</screen>
|
||||
|
||||
Note that <function>plainto_tsquery</> cannot
|
||||
recognize Boolean operators, weight labels, or prefix-match labels
|
||||
in its input:
|
||||
recognize Boolean and phrase search operators, weight labels,
|
||||
or prefix-match labels in its input:
|
||||
|
||||
<screen>
|
||||
SELECT plainto_tsquery('english', 'The Fat & Rats:C');
|
||||
@ -827,6 +861,57 @@ SELECT plainto_tsquery('english', 'The Fat & Rats:C');
|
||||
Here, all the input punctuation was discarded as being space symbols.
|
||||
</para>
|
||||
|
||||
<indexterm>
|
||||
<primary>phraseto_tsquery</primary>
|
||||
</indexterm>
|
||||
|
||||
<synopsis>
|
||||
phraseto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
|
||||
</synopsis>
|
||||
|
||||
<para>
|
||||
<function>phraseto_tsquery</> behaves much like
|
||||
<function>plainto_tsquery</>, with the exception
|
||||
that it utilizes the <literal><-></literal> (FOLLOWED BY) phrase search
|
||||
operator instead of the <literal>&</literal> (AND) Boolean operator.
|
||||
This is particularly useful when searching for exact lexeme sequences,
|
||||
since the phrase search operator helps to maintain lexeme order.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Example:
|
||||
|
||||
<screen>
|
||||
SELECT phraseto_tsquery('english', 'The Fat Rats');
|
||||
phraseto_tsquery
|
||||
------------------
|
||||
'fat' <-> 'rat'
|
||||
</screen>
|
||||
|
||||
Just like the <function>plainto_tsquery</>, the
|
||||
<function>phraseto_tsquery</> function cannot
|
||||
recognize Boolean and phrase search operators, weight labels,
|
||||
or prefix-match labels in its input:
|
||||
|
||||
<screen>
|
||||
SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
|
||||
phraseto_tsquery
|
||||
-----------------------------
|
||||
( 'fat' <-> 'rat' ) <-> 'c'
|
||||
</screen>
|
||||
|
||||
It is possible to specify the configuration to be used to parse the document,
|
||||
for example, we could create a new one using the hunspell dictionary
|
||||
(namely 'eng_hunspell') in order to match phrases with different word forms:
|
||||
|
||||
<screen>
|
||||
SELECT phraseto_tsquery('eng_hunspell', 'developer of the building which collapsed');
|
||||
phraseto_tsquery
|
||||
--------------------------------------------------------------------------------------------
|
||||
( 'developer' <3> 'building' ) <2> 'collapse' | ( 'developer' <3> 'build' ) <2> 'collapse'
|
||||
</screen>
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2 id="textsearch-ranking">
|
||||
@ -1387,6 +1472,81 @@ FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
|
||||
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
|
||||
<term>
|
||||
<literal><type>tsquery</> <-> <type>tsquery</></literal>
|
||||
</term>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Returns the phrase-concatenation of the two given queries.
|
||||
|
||||
<screen>
|
||||
SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
|
||||
?column?
|
||||
-----------------------------------
|
||||
'fat' <-> 'cat' | 'fat' <-> 'rat'
|
||||
</screen>
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
|
||||
<term>
|
||||
<indexterm>
|
||||
<primary>tsquery_phrase</primary>
|
||||
</indexterm>
|
||||
|
||||
<literal>tsquery_phrase(<replaceable class="PARAMETER">query1</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">query2</replaceable> <type>tsquery</> [, <replaceable class="PARAMETER">distance</replaceable> <type>integer</> ]) returns <type>tsquery</></literal>
|
||||
</term>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Returns the distanced phrase-concatenation of the two given queries.
|
||||
This function lies in the implementation of the <literal><-></> operator.
|
||||
|
||||
<screen>
|
||||
SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
|
||||
tsquery_phrase
|
||||
------------------
|
||||
'fat' <10> 'cat'
|
||||
</screen>
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
|
||||
<term>
|
||||
<indexterm>
|
||||
<primary>setweight</primary>
|
||||
</indexterm>
|
||||
|
||||
<literal>setweight(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>, <replaceable class="PARAMETER">weight</replaceable> <type>"char"</>) returns <type>tsquery</></literal>
|
||||
</term>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
<function>setweight</> returns a copy of the input query in which every
|
||||
position has been labeled with the given <replaceable>weight</>(s), either
|
||||
<literal>A</literal>, <literal>B</literal>, <literal>C</literal>,
|
||||
<literal>D</literal> or their combination. These labels are retained when
|
||||
queries are concatenated, allowing words from different parts of a document
|
||||
to be weighted differently by ranking functions.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that weight labels apply to <emphasis>positions</>, not
|
||||
<emphasis>lexemes</>. If the input query has been stripped of
|
||||
positions then <function>setweight</> does nothing.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
|
||||
<term>
|
||||
@ -2428,7 +2588,7 @@ more sample word(s) : more indexed word(s)
|
||||
|
||||
<para>
|
||||
Specific stop words recognized by the subdictionary cannot be
|
||||
specified; instead use <literal>?</> to mark the location where any
|
||||
specified; instead use <literal><-></> to mark the location where any
|
||||
stop word can appear. For example, assuming that <literal>a</> and
|
||||
<literal>the</> are stop words according to the subdictionary:
|
||||
|
||||
|
Reference in New Issue
Block a user