Rework word_similarity documentation, make it close to actual algorithm.

word_similarity before claimed as returning similarity of closest word in string, but, actually it returns similarity of substring. Also fix mistyped comments. Author: Alexander Korotkov Review by: David Steele, Liudmila Mantrova Discussionis: https://www.postgresql.org/message-id/flat/CY4PR17MB13207ED8310F847CF117EED0D85A0@CY4PR17MB1320.namprd17.prod.outlook.com https://www.postgresql.org/message-id/flat/f43b242d-000c-f4c8-cb8b-d37e9752cd93%40postgrespro.ru
2025-07-27 12:41:57 +03:00 · 2018-03-21 14:37:18 +03:00
parent 8bcdba9a20
commit 5b1b7286c9
2 changed files with 44 additions and 16 deletions
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
 			lastpos[trgindex] = i;
 		}
-		/* Adjust lower bound if this trigram is present in required substring */
+		/* Adjust upper bound if this trigram is present in required substring */
 		if (found[trgindex])
 		{
 			int			prev_lower,
@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
 			smlr_cur = CALCSML(count, ulen1, ulen2);
-			/* Also try to adjust upper bound for greater similarity */
+			/* Also try to adjust lower bound for greater similarity */
 			tmp_count = count;
 			tmp_ulen2 = ulen2;
 			prev_lower = lower;
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@ -99,12 +99,10 @@
      </entry>
      <entry><type>real</type></entry>
      <entry>
-       Returns a number that indicates how similar the first string
+       Returns a number that indicates the greatest similarity between
-       to the most similar word of the second string. The function searches in
+       the set of trigrams in the first string and any continuous extent
-       the second string a most similar word not a most similar substring.  The
+       of an ordered set of trigrams in the second string.  For details, see
-       range of the result is zero (indicating that the two strings are
+       the explanation below.
       completely dissimilar) to one (indicating that the first string is
       identical to one of the words of the second string).
      </entry>
     </row>
     <row>
@ -131,6 +129,34 @@
   </tgroup>
  </table>
  <para>
   Consider the following example:
 <programlisting>
 # SELECT word_similarity('word', 'two words');
 word_similarity
 -----------------
             0.8
 (1 row)
 </programlisting>
   In the first string, the set of trigrams is
   <literal>{"  w"," wo","ord","wor","rd "}</literal>.
   In the second string, the ordered set of trigrams is
   <literal>{"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}</literal>.
   The most similar extent of an ordered set of trigrams in the second string
   is <literal>{"  w"," wo","wor","ord"}</literal>, and the similarity is
   <literal>0.8</literal>.
  </para>
  <para>
   This function returns a value that can be approximately understood as the
   greatest similarity between the first string and any substring of the second
   string.  However, this function does not add padding to the boundaries of
   the extent.  Thus, a whole word match gets a higher score than a match with
   a part of the word.
  </para>
  <table id="pgtrgm-op-table">
   <title><filename>pg_trgm</filename> Operators</title>
   <tgroup cols="3">
@ -156,10 +182,11 @@
      <entry><type>text</> <literal>&lt;%</literal> <type>text</></entry>
      <entry><type>boolean</type></entry>
      <entry>
-       Returns <literal>true</> if its first argument has the similar word in
+       Returns <literal>true</literal> if the similarity between the trigram
-       the second argument and they have a similarity that is greater than the
+       set in the first argument and a continuous extent of an ordered trigram
-       current word similarity threshold set by
+       set in the second argument is greater than the current word similarity
-       <varname>pg_trgm.word_similarity_threshold</> parameter.
+       threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
       parameter.
      </entry>
     </row>
     <row>
@ -302,10 +329,11 @@ SELECT t, word_similarity('<replaceable>word</>', t) AS sml
  WHERE '<replaceable>word</>' &lt;% t
  ORDER BY sml DESC, t;
 </programlisting>
-   This will return all values in the text column that have a word
+   This will return all values in the text column for which there is a
-   which sufficiently similar to <replaceable>word</>, sorted from best
+   continuous extent in the corresponding ordered trigram set that is
-   match to worst.  The index will be used to make this a fast operation
+   sufficiently similar to the trigram set of <replaceable>word</replaceable>,
-   even over very large data sets.
+   sorted from best match to worst.  The index will be used to make this
   a fast operation even over very large data sets.
  </para>
  <para>