mirror of
https://github.com/postgres/postgres.git
synced 2025-07-27 12:41:57 +03:00
Rework word_similarity documentation, make it close to actual algorithm.
word_similarity before claimed as returning similarity of closest word in string, but, actually it returns similarity of substring. Also fix mistyped comments. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova Discussionis: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
This commit is contained in:
@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
|
|||||||
lastpos[trgindex] = i;
|
lastpos[trgindex] = i;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Adjust lower bound if this trigram is present in required substring */
|
/* Adjust upper bound if this trigram is present in required substring */
|
||||||
if (found[trgindex])
|
if (found[trgindex])
|
||||||
{
|
{
|
||||||
int prev_lower,
|
int prev_lower,
|
||||||
@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
|
|||||||
|
|
||||||
smlr_cur = CALCSML(count, ulen1, ulen2);
|
smlr_cur = CALCSML(count, ulen1, ulen2);
|
||||||
|
|
||||||
/* Also try to adjust upper bound for greater similarity */
|
/* Also try to adjust lower bound for greater similarity */
|
||||||
tmp_count = count;
|
tmp_count = count;
|
||||||
tmp_ulen2 = ulen2;
|
tmp_ulen2 = ulen2;
|
||||||
prev_lower = lower;
|
prev_lower = lower;
|
||||||
|
@ -99,12 +99,10 @@
|
|||||||
</entry>
|
</entry>
|
||||||
<entry><type>real</type></entry>
|
<entry><type>real</type></entry>
|
||||||
<entry>
|
<entry>
|
||||||
Returns a number that indicates how similar the first string
|
Returns a number that indicates the greatest similarity between
|
||||||
to the most similar word of the second string. The function searches in
|
the set of trigrams in the first string and any continuous extent
|
||||||
the second string a most similar word not a most similar substring. The
|
of an ordered set of trigrams in the second string. For details, see
|
||||||
range of the result is zero (indicating that the two strings are
|
the explanation below.
|
||||||
completely dissimilar) to one (indicating that the first string is
|
|
||||||
identical to one of the words of the second string).
|
|
||||||
</entry>
|
</entry>
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
@ -131,6 +129,34 @@
|
|||||||
</tgroup>
|
</tgroup>
|
||||||
</table>
|
</table>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Consider the following example:
|
||||||
|
|
||||||
|
<programlisting>
|
||||||
|
# SELECT word_similarity('word', 'two words');
|
||||||
|
word_similarity
|
||||||
|
-----------------
|
||||||
|
0.8
|
||||||
|
(1 row)
|
||||||
|
</programlisting>
|
||||||
|
|
||||||
|
In the first string, the set of trigrams is
|
||||||
|
<literal>{" w"," wo","ord","wor","rd "}</literal>.
|
||||||
|
In the second string, the ordered set of trigrams is
|
||||||
|
<literal>{" t"," tw",two,"wo "," w"," wo","wor","ord","rds", ds "}</literal>.
|
||||||
|
The most similar extent of an ordered set of trigrams in the second string
|
||||||
|
is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
|
||||||
|
<literal>0.8</literal>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
This function returns a value that can be approximately understood as the
|
||||||
|
greatest similarity between the first string and any substring of the second
|
||||||
|
string. However, this function does not add padding to the boundaries of
|
||||||
|
the extent. Thus, a whole word match gets a higher score than a match with
|
||||||
|
a part of the word.
|
||||||
|
</para>
|
||||||
|
|
||||||
<table id="pgtrgm-op-table">
|
<table id="pgtrgm-op-table">
|
||||||
<title><filename>pg_trgm</filename> Operators</title>
|
<title><filename>pg_trgm</filename> Operators</title>
|
||||||
<tgroup cols="3">
|
<tgroup cols="3">
|
||||||
@ -156,10 +182,11 @@
|
|||||||
<entry><type>text</> <literal><%</literal> <type>text</></entry>
|
<entry><type>text</> <literal><%</literal> <type>text</></entry>
|
||||||
<entry><type>boolean</type></entry>
|
<entry><type>boolean</type></entry>
|
||||||
<entry>
|
<entry>
|
||||||
Returns <literal>true</> if its first argument has the similar word in
|
Returns <literal>true</literal> if the similarity between the trigram
|
||||||
the second argument and they have a similarity that is greater than the
|
set in the first argument and a continuous extent of an ordered trigram
|
||||||
current word similarity threshold set by
|
set in the second argument is greater than the current word similarity
|
||||||
<varname>pg_trgm.word_similarity_threshold</> parameter.
|
threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
|
||||||
|
parameter.
|
||||||
</entry>
|
</entry>
|
||||||
</row>
|
</row>
|
||||||
<row>
|
<row>
|
||||||
@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</>', t) AS sml
|
|||||||
WHERE '<replaceable>word</>' <% t
|
WHERE '<replaceable>word</>' <% t
|
||||||
ORDER BY sml DESC, t;
|
ORDER BY sml DESC, t;
|
||||||
</programlisting>
|
</programlisting>
|
||||||
This will return all values in the text column that have a word
|
This will return all values in the text column for which there is a
|
||||||
which sufficiently similar to <replaceable>word</>, sorted from best
|
continuous extent in the corresponding ordered trigram set that is
|
||||||
match to worst. The index will be used to make this a fast operation
|
sufficiently similar to the trigram set of <replaceable>word</replaceable>,
|
||||||
even over very large data sets.
|
sorted from best match to worst. The index will be used to make this
|
||||||
|
a fast operation even over very large data sets.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
|
Reference in New Issue
Block a user