1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-27 12:41:57 +03:00

Support C.UTF-8 locale in the new builtin collation provider.

The builtin C.UTF-8 locale has similar semantics to the libc locale of
the same name. That is, code point sort order (fast, memcmp-based)
combined with Unicode semantics for character operations such as
pattern matching, regular expressions, and
LOWER()/INITCAP()/UPPER(). The character semantics are based on
Unicode simple case mappings.

The builtin provider's C.UTF-8 offers several important advantages
over libc:

 * faster sorting -- benefits from additional optimizations such as
   abbreviated keys and varstrfastcmp_c
 * faster case conversion, e.g. LOWER(), at least compared with some
   libc implementations
 * available on all platforms with identical semantics, and the
   semantics are stable, testable, and documentable within a given
   Postgres major version

Being based on memcmp, the builtin C.UTF-8 locale does not offer
natural language sort order. But it is an improvement for most use
cases that might otherwise use libc's "C.UTF-8" locale, as well as
many use cases that use libc's "C" locale.

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
This commit is contained in:
Jeff Davis
2024-03-19 15:24:41 -07:00
parent fd0398fcb0
commit f69319f2f1
17 changed files with 494 additions and 26 deletions

View File

@ -377,13 +377,21 @@ initdb --locale-provider=icu --icu-locale=en
<listitem>
<para>
The <literal>builtin</literal> provider uses built-in operations. Only
the <literal>C</literal> locale is supported for this provider.
the <literal>C</literal> and <literal>C.UTF-8</literal> locales are
supported for this provider.
</para>
<para>
The <literal>C</literal> locale behavior is identical to the
<literal>C</literal> locale in the libc provider. When using this
locale, the behavior may depend on the database encoding.
</para>
<para>
The <literal>C.UTF-8</literal> locale is available only for when the
database encoding is <literal>UTF-8</literal>, and the behavior is
based on Unicode. The collation uses the code point values only. The
regular expression character classes are based on the "POSIX
Compatible" semantics, and the case mapping is the "simple" variant.
</para>
</listitem>
</varlistentry>
@ -878,6 +886,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
</listitem>
</varlistentry>
<varlistentry>
<term><literal>pg_c_utf8</literal></term>
<listitem>
<para>
This collation sorts by Unicode code point values rather than natural
language order. For the functions <function>lower</function>,
<function>initcap</function>, and <function>upper</function>, it uses
Unicode simple case mapping. For pattern matching (including regular
expressions), it uses the POSIX Compatible variant of Unicode <ulink
url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility
Properties</ulink>. Behavior is efficient and stable within a
<productname>Postgres</productname> major version. This collation is
only available for encoding <literal>UTF8</literal>.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>C</literal> (equivalent to <literal>POSIX</literal>)</term>
<listitem>

View File

@ -99,7 +99,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<para>
If <replaceable>provider</replaceable> is <literal>builtin</literal>,
then <replaceable>locale</replaceable> must be specified and set to
<literal>C</literal>.
either <literal>C</literal> or <literal>C.UTF-8</literal>.
</para>
</listitem>
</varlistentry>

View File

@ -166,8 +166,9 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
</para>
<para>
If <xref linkend="create-database-locale-provider"/> is
<literal>builtin</literal>, then <replaceable>locale</replaceable>
must be specified and set to <literal>C</literal>.
<literal>builtin</literal>, then <replaceable>locale</replaceable> or
<replaceable>builtin_locale</replaceable> must be specified and set to
either <literal>C</literal> or <literal>C.UTF-8</literal>.
</para>
<tip>
<para>
@ -228,9 +229,11 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
linkend="create-database-locale-provider">locale provider</link> must
be <literal>builtin</literal>. The default is the setting of <xref
linkend="create-database-locale"/> if specified; otherwise the same
setting as the template database. Currently, the only available
locale for the <literal>builtin</literal> provider is
<literal>C</literal>.
setting as the template database.
</para>
<para>
The locales available for the <literal>builtin</literal> provider are
<literal>C</literal> and <literal>C.UTF-8</literal>.
</para>
</listitem>
</varlistentry>

View File

@ -288,8 +288,9 @@ PostgreSQL documentation
</para>
<para>
If <option>--locale-provider</option> is <literal>builtin</literal>,
<option>--locale</option> must be specified and set to
<literal>C</literal>.
<option>--locale</option> or <option>--builtin-locale</option> must be
specified and set to <literal>C</literal> or
<literal>C.UTF-8</literal>.
</para>
</listitem>
</varlistentry>