1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-27 12:41:57 +03:00

Support LIKE with nondeterministic collations

This allows for example using LIKE with case-insensitive collations.
There was previously no internal implementation of this, so it was met
with a not-supported error.  This adds the internal implementation and
removes the error.  The implementation follows the specification of
the SQL standard for this.

Unlike with deterministic collations, the LIKE matching cannot go
character by character but has to go substring by substring.  For
example, if we are matching against LIKE 'foo%bar', we can't start by
looking for an 'f', then an 'o', but instead with have to find
something that matches 'foo'.  This is because the collation could
consider substrings of different lengths to be equal.  This is all
internal to MatchText() in like_match.c.

The changes in GenericMatchText() in like.c just pass through the
locale information to MatchText(), which was previously not needed.
This matches exactly Generic_Text_IC_like() below.

ILIKE is not affected.  (It's unclear whether ILIKE makes sense under
nondeterministic collations.)

This also updates match_pattern_prefix() in like_support.c to support
optimizing the case of an exact pattern with nondeterministic
collations.  This was already alluded to in the previous code.

(includes documentation examples from Daniel Vérité and test cases
from Paul A Jungwirth)

Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
This commit is contained in:
Peter Eisentraut
2024-11-27 08:18:35 +01:00
parent 8fcd80258b
commit 85b7efa1cd
7 changed files with 458 additions and 44 deletions

View File

@ -1197,7 +1197,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
to a performance penalty. Note, in particular, that B-tree cannot use
deduplication with indexes that use a nondeterministic collation. Also,
certain operations are not possible with nondeterministic collations,
such as pattern matching operations. Therefore, they should be used
such as some pattern matching operations. Therefore, they should be used
only in cases where they are specifically wanted.
</para>

View File

@ -5414,9 +5414,10 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
</caution>
<para>
The pattern matching operators of all three kinds do not support
nondeterministic collations. If required, apply a different collation to
the expression to work around this limitation.
<function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
expressions do not support nondeterministic collations. If required, use
<function>LIKE</function> or apply a different collation to the expression
to work around this limitation.
</para>
<sect2 id="functions-like">
@ -5462,6 +5463,46 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
</programlisting>
</para>
<para>
<function>LIKE</function> pattern matching supports nondeterministic
collations (see <xref linkend="collation-nondeterministic"/>), such as
case-insensitive collations or collations that, say, ignore punctuation.
So with a case-insensitive collation, one could have:
<programlisting>
'AbC' LIKE 'abc' COLLATE case_insensitive <lineannotation>true</lineannotation>
'AbC' LIKE 'a%' COLLATE case_insensitive <lineannotation>true</lineannotation>
</programlisting>
With collations that ignore certain characters or in general that consider
strings of different lengths equal, the semantics can become a bit more
complicated. Consider these examples:
<programlisting>
'.foo.' LIKE 'foo' COLLATE ign_punct <lineannotation>true</lineannotation>
'.foo.' LIKE 'f_o' COLLATE ign_punct <lineannotation>true</lineannotation>
'.foo.' LIKE '_oo' COLLATE ign_punct <lineannotation>false</lineannotation>
</programlisting>
The way the matching works is that the pattern is partitioned into
sequences of wildcards and non-wildcard strings (wildcards being
<literal>_</literal> and <literal>%</literal>). For example, the pattern
<literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
pattern <literal>_oo</literal> is partitioned into <literal>_,
oo</literal>. The input string matches the pattern if it can be
partitioned in such a way that the wildcards match one character or any
number of characters respectively and the non-wildcard partitions are
equal under the applicable collation. So for example, <literal>'.foo.'
LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
<literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
<literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
ign_punct</literal>. But <literal>'.foo.' LIKE '_oo' COLLATE
ign_punct</literal> is false because <literal>.foo.</literal> cannot be
partitioned in a way that the first character is any character and the
rest of the string compares equal to <literal>oo</literal>. (Note that
the single-character wildcard always matches exactly one character,
independent of the collation. So in this example, the
<literal>_</literal> would match <literal>.</literal>, but then the rest
of the input string won't match the rest of the pattern.)
</para>
<para>
<function>LIKE</function> pattern matching always covers the entire
string. Therefore, if it's desired to match a sequence anywhere within
@ -5503,8 +5544,9 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
<para>
The key word <token>ILIKE</token> can be used instead of
<token>LIKE</token> to make the match case-insensitive according
to the active locale. This is not in the <acronym>SQL</acronym> standard but is a
<token>LIKE</token> to make the match case-insensitive according to the
active locale. (But this does not support nondeterministic collations.)
This is not in the <acronym>SQL</acronym> standard but is a
<productname>PostgreSQL</productname> extension.
</para>