Support LIKE with nondeterministic collations

This allows for example using LIKE with case-insensitive collations. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. The implementation follows the specification of the SQL standard for this. Unlike with deterministic collations, the LIKE matching cannot go character by character but has to go substring by substring. For example, if we are matching against LIKE 'foo%bar', we can't start by looking for an 'f', then an 'o', but instead with have to find something that matches 'foo'. This is because the collation could consider substrings of different lengths to be equal. This is all internal to MatchText() in like_match.c. The changes in GenericMatchText() in like.c just pass through the locale information to MatchText(), which was previously not needed. This matches exactly Generic_Text_IC_like() below. ILIKE is not affected. (It's unclear whether ILIKE makes sense under nondeterministic collations.) This also updates match_pattern_prefix() in like_support.c to support optimizing the case of an exact pattern with nondeterministic collations. This was already alluded to in the previous code. (includes documentation examples from Daniel Vérité and test cases from Paul A Jungwirth) Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
2025-07-27 12:41:57 +03:00 · 2024-11-27 08:18:35 +01:00
parent 8fcd80258b
commit 85b7efa1cd
7 changed files with 458 additions and 44 deletions
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@ -1197,7 +1197,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
     to a performance penalty.  Note, in particular, that B-tree cannot use
     deduplication with indexes that use a nondeterministic collation.  Also,
     certain operations are not possible with nondeterministic collations,
-     such as pattern matching operations.  Therefore, they should be used
+     such as some pattern matching operations.  Therefore, they should be used
     only in cases where they are specifically wanted.
    </para>

--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@ -5414,9 +5414,10 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
   </caution>

   <para>
-    The pattern matching operators of all three kinds do not support
-    nondeterministic collations.  If required, apply a different collation to
-    the expression to work around this limitation.
+    <function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
+    expressions do not support nondeterministic collations.  If required, use
+    <function>LIKE</function> or apply a different collation to the expression
+    to work around this limitation.
   </para>

  <sect2 id="functions-like">
@ -5462,6 +5463,46 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>
 </programlisting>
   </para>

+   <para>
+    <function>LIKE</function> pattern matching supports nondeterministic
+    collations (see <xref linkend="collation-nondeterministic"/>), such as
+    case-insensitive collations or collations that, say, ignore punctuation.
+    So with a case-insensitive collation, one could have:
+<programlisting>
+'AbC' LIKE 'abc' COLLATE case_insensitive    <lineannotation>true</lineannotation>
+'AbC' LIKE 'a%' COLLATE case_insensitive     <lineannotation>true</lineannotation>
+</programlisting>
+    With collations that ignore certain characters or in general that consider
+    strings of different lengths equal, the semantics can become a bit more
+    complicated.  Consider these examples:
+<programlisting>
+'.foo.' LIKE 'foo' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE 'f_o' COLLATE ign_punct    <lineannotation>true</lineannotation>
+'.foo.' LIKE '_oo' COLLATE ign_punct    <lineannotation>false</lineannotation>
+</programlisting>
+    The way the matching works is that the pattern is partitioned into
+    sequences of wildcards and non-wildcard strings (wildcards being
+    <literal>_</literal> and <literal>%</literal>).  For example, the pattern
+    <literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
+    pattern <literal>_oo</literal> is partitioned into <literal>_,
+    oo</literal>.  The input string matches the pattern if it can be
+    partitioned in such a way that the wildcards match one character or any
+    number of characters respectively and the non-wildcard partitions are
+    equal under the applicable collation.  So for example, <literal>'.foo.'
+    LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
+    <literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
+    <literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
+    matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
+    ign_punct</literal>.  But <literal>'.foo.' LIKE '_oo' COLLATE
+    ign_punct</literal> is false because <literal>.foo.</literal> cannot be
+    partitioned in a way that the first character is any character and the
+    rest of the string compares equal to <literal>oo</literal>.  (Note that
+    the single-character wildcard always matches exactly one character,
+    independent of the collation.  So in this example, the
+    <literal>_</literal> would match <literal>.</literal>, but then the rest
+    of the input string won't match the rest of the pattern.)
+   </para>
+
   <para>
    <function>LIKE</function> pattern matching always covers the entire
    string.  Therefore, if it's desired to match a sequence anywhere within
@ -5503,8 +5544,9 @@ cast(-44 as bit(12))           <lineannotation>111111010100</lineannotation>

   <para>
    The key word <token>ILIKE</token> can be used instead of
-    <token>LIKE</token> to make the match case-insensitive according
-    to the active locale.  This is not in the <acronym>SQL</acronym> standard but is a
+    <token>LIKE</token> to make the match case-insensitive according to the
+    active locale.  (But this does not support nondeterministic collations.)
+    This is not in the <acronym>SQL</acronym> standard but is a
    <productname>PostgreSQL</productname> extension.
   </para>