mirror of
https://github.com/postgres/postgres.git
synced 2025-07-30 11:03:19 +03:00
Support PG_UNICODE_FAST locale in the builtin collation provider.
The PG_UNICODE_FAST locale uses code point sort order (fast, memcmp-based) combined with Unicode character semantics. The character semantics are based on Unicode full case mapping. Full case mapping can map a single codepoint to multiple codepoints, such as "ß" uppercasing to "SS". Additionally, it handles context-sensitive mappings like the "final sigma", and it uses titlecase mappings such as "Dž" when titlecasing (rather than plain uppercase mappings). Importantly, the uppercasing of "ß" as "SS" is specifically mentioned by the SQL standard. In Postgres, UCS_BASIC uses plain ASCII semantics for case mapping and pattern matching, so if we changed it to use the PG_UNICODE_FAST locale, it would offer better compliance with the standard. For now, though, do not change the behavior of UCS_BASIC. Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org Reviewed-by: Peter Eisentraut, Daniel Verite
This commit is contained in:
@ -377,8 +377,9 @@ initdb --locale-provider=icu --icu-locale=en
|
||||
<listitem>
|
||||
<para>
|
||||
The <literal>builtin</literal> provider uses built-in operations. Only
|
||||
the <literal>C</literal> and <literal>C.UTF-8</literal> locales are
|
||||
supported for this provider.
|
||||
the <literal>C</literal>, <literal>C.UTF-8</literal>, and
|
||||
<literal>PG_UNICODE_FAST</literal> locales are supported for this
|
||||
provider.
|
||||
</para>
|
||||
<para>
|
||||
The <literal>C</literal> locale behavior is identical to the
|
||||
@ -392,6 +393,13 @@ initdb --locale-provider=icu --icu-locale=en
|
||||
regular expression character classes are based on the "POSIX
|
||||
Compatible" semantics, and the case mapping is the "simple" variant.
|
||||
</para>
|
||||
<para>
|
||||
The <literal>PG_UNICODE_FAST</literal> locale is available only when
|
||||
the database encoding is <literal>UTF-8</literal>, and the behavior is
|
||||
based on Unicode. The collation uses the code point values only. The
|
||||
regular expression character classes are based on the "Standard"
|
||||
semantics, and the case mapping is the "full" variant.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
@ -886,6 +894,23 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>pg_unicode_fast</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
This collation sorts by Unicode code point values rather than natural
|
||||
language order. For the functions <function>lower</function>,
|
||||
<function>initcap</function>, and <function>upper</function> it uses
|
||||
Unicode full case mapping. For pattern matching (including regular
|
||||
expressions), it uses the Standard variant of Unicode <ulink
|
||||
url="https://www.unicode.org/reports/tr18/#Compatibility_Properties">Compatibility
|
||||
Properties</ulink>. Behavior is efficient and stable within a
|
||||
<productname>Postgres</productname> major version. It is only
|
||||
available for encoding <literal>UTF8</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>pg_c_utf8</literal></term>
|
||||
<listitem>
|
||||
|
@ -99,7 +99,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
|
||||
<para>
|
||||
If <replaceable>provider</replaceable> is <literal>builtin</literal>,
|
||||
then <replaceable>locale</replaceable> must be specified and set to
|
||||
either <literal>C</literal> or <literal>C.UTF-8</literal>.
|
||||
either <literal>C</literal>, <literal>C.UTF-8</literal> or
|
||||
<literal>PG_UNICODE_FAST</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -168,7 +168,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
|
||||
If <xref linkend="create-database-locale-provider"/> is
|
||||
<literal>builtin</literal>, then <replaceable>locale</replaceable> or
|
||||
<replaceable>builtin_locale</replaceable> must be specified and set to
|
||||
either <literal>C</literal> or <literal>C.UTF-8</literal>.
|
||||
either <literal>C</literal>, <literal>C.UTF-8</literal>, or
|
||||
<literal>PG_UNICODE_FAST</literal>.
|
||||
</para>
|
||||
<tip>
|
||||
<para>
|
||||
@ -233,7 +234,8 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
|
||||
</para>
|
||||
<para>
|
||||
The locales available for the <literal>builtin</literal> provider are
|
||||
<literal>C</literal> and <literal>C.UTF-8</literal>.
|
||||
<literal>C</literal>, <literal>C.UTF-8</literal> and
|
||||
<literal>PG_UNICODE_FAST</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
@ -295,8 +295,8 @@ PostgreSQL documentation
|
||||
<para>
|
||||
If <option>--locale-provider</option> is <literal>builtin</literal>,
|
||||
<option>--locale</option> or <option>--builtin-locale</option> must be
|
||||
specified and set to <literal>C</literal> or
|
||||
<literal>C.UTF-8</literal>.
|
||||
specified and set to <literal>C</literal>, <literal>C.UTF-8</literal>
|
||||
or <literal>PG_UNICODE_FAST</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
Reference in New Issue
Block a user