mirror of
https://github.com/postgres/postgres.git
synced 2025-07-30 11:03:19 +03:00
Add strict_word_similarity to pg_trgm module
strict_word_similarity is similar to existing word_similarity function but it takes into account word boundaries to compute similarity. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova, me Discussion: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com
This commit is contained in:
@ -105,6 +105,17 @@
|
||||
the explanation below.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<function>strict_word_similarity(text, text)</function>
|
||||
<indexterm><primary>strict_word_similarity</primary></indexterm>
|
||||
</entry>
|
||||
<entry><type>real</type></entry>
|
||||
<entry>
|
||||
Same as <function>word_similarity(text, text)</function>, but forces
|
||||
extent boundaries to match word boundaries.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><function>show_limit()</function><indexterm><primary>show_limit</primary></indexterm></entry>
|
||||
<entry><type>real</type></entry>
|
||||
@ -157,6 +168,29 @@
|
||||
a part of the word.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
At the same time, <function>strict_word_similarity(text, text)</function>
|
||||
has to select an extent that matches word boundaries. In the example above,
|
||||
<function>strict_word_similarity(text, text)</function> would select the
|
||||
extent <literal>{" w"," wo","wor","ord","rds", ds "}</literal>, which
|
||||
corresponds to the whole word <literal>'words'</literal>.
|
||||
|
||||
<programlisting>
|
||||
# SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');
|
||||
strict_word_similarity | similarity
|
||||
------------------------+------------
|
||||
0.571429 | 0.571429
|
||||
(1 row)
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Thus, the <function>strict_word_similarity(text, text)</function> function
|
||||
is useful for finding similar subsets of whole words, while
|
||||
<function>word_similarity(text, text)</function> is more suitable for
|
||||
searching similar parts of words.
|
||||
</para>
|
||||
|
||||
<table id="pgtrgm-op-table">
|
||||
<title><filename>pg_trgm</filename> Operators</title>
|
||||
<tgroup cols="3">
|
||||
@ -196,6 +230,24 @@
|
||||
Commutator of the <literal><%</literal> operator.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><type>text</type> <literal><<%</literal> <type>text</type></entry>
|
||||
<entry><type>boolean</type></entry>
|
||||
<entry>
|
||||
Returns <literal>true</literal> if its second argument has a continuous
|
||||
extent of an ordered trigram set that matches word boundaries,
|
||||
and its similarity to the trigram set of the first argument is greater
|
||||
than the current strict word similarity threshold set by the
|
||||
<varname>pg_trgm.strict_word_similarity_threshold</varname> parameter.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><type>text</type> <literal>%>></literal> <type>text</type></entry>
|
||||
<entry><type>boolean</type></entry>
|
||||
<entry>
|
||||
Commutator of the <literal><<%</literal> operator.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><type>text</type> <literal><-></literal> <type>text</type></entry>
|
||||
<entry><type>real</type></entry>
|
||||
@ -223,6 +275,25 @@
|
||||
Commutator of the <literal><<-></literal> operator.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<type>text</type> <literal><<<-></literal> <type>text</type>
|
||||
</entry>
|
||||
<entry><type>real</type></entry>
|
||||
<entry>
|
||||
Returns the <quote>distance</quote> between the arguments, that is
|
||||
one minus the <function>strict_word_similarity()</function> value.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<type>text</type> <literal><->>></literal> <type>text</type>
|
||||
</entry>
|
||||
<entry><type>real</type></entry>
|
||||
<entry>
|
||||
Commutator of the <literal><<<-></literal> operator.
|
||||
</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
@ -322,12 +393,19 @@ SELECT t, t <-> '<replaceable>word</replaceable>' AS dist
|
||||
|
||||
<para>
|
||||
Also you can use an index on the <structfield>t</structfield> column for word
|
||||
similarity. For example:
|
||||
similarity or strict word similarity. Typical queries are:
|
||||
<programlisting>
|
||||
SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
|
||||
FROM test_trgm
|
||||
WHERE '<replaceable>word</replaceable>' <% t
|
||||
ORDER BY sml DESC, t;
|
||||
</programlisting>
|
||||
and
|
||||
<programlisting>
|
||||
SELECT t, strict_word_similarity('<replaceable>word</replaceable>', t) AS sml
|
||||
FROM test_trgm
|
||||
WHERE '<replaceable>word</replaceable>' <<% t
|
||||
ORDER BY sml DESC, t;
|
||||
</programlisting>
|
||||
This will return all values in the text column for which there is a
|
||||
continuous extent in the corresponding ordered trigram set that is
|
||||
@ -337,11 +415,17 @@ SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A variant of the above query is
|
||||
Possible variants of the above queries are:
|
||||
<programlisting>
|
||||
SELECT t, '<replaceable>word</replaceable>' <<-> t AS dist
|
||||
FROM test_trgm
|
||||
ORDER BY dist LIMIT 10;
|
||||
</programlisting>
|
||||
and
|
||||
<programlisting>
|
||||
SELECT t, '<replaceable>word</replaceable>' <<<-> t AS dist
|
||||
FROM test_trgm
|
||||
ORDER BY dist LIMIT 10;
|
||||
</programlisting>
|
||||
This can be implemented quite efficiently by GiST indexes, but not
|
||||
by GIN indexes.
|
||||
|
Reference in New Issue
Block a user