mirror of
https://github.com/postgres/postgres.git
synced 2025-07-30 11:03:19 +03:00
ICU support
Add a column collprovider to pg_collation that determines which library provides the collation data. The existing choices are default and libc, and this adds an icu choice, which uses the ICU4C library. The pg_locale_t type is changed to a union that contains the provider-specific locale handles. Users of locale information are changed to look into that struct for the appropriate handle to use. Also add a collversion column that records the version of the collation when it is created, and check at run time whether it is still the same. This detects potentially incompatible library upgrades that can corrupt indexes and other structures. This is currently only supported by ICU-provided collations. initdb initializes the default collation set as before from the `locale -a` output but also adds all available ICU locales with a "-x-icu" appended. Currently, ICU-provided collations can only be explicitly named collations. The global database locales are still always libc-provided. ICU support is enabled by configure --with-icu. Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se>
This commit is contained in:
@ -2020,6 +2020,14 @@
|
||||
<entry>Owner of the collation</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><structfield>collprovider</structfield></entry>
|
||||
<entry><type>char</type></entry>
|
||||
<entry></entry>
|
||||
<entry>Provider of the collation: <literal>d</literal> = database
|
||||
default, <literal>c</literal> = libc, <literal>i</literal> = icu</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><structfield>collencoding</structfield></entry>
|
||||
<entry><type>int4</type></entry>
|
||||
@ -2041,6 +2049,17 @@
|
||||
<entry></entry>
|
||||
<entry><symbol>LC_CTYPE</> for this collation object</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry><structfield>collversion</structfield></entry>
|
||||
<entry><type>text</type></entry>
|
||||
<entry></entry>
|
||||
<entry>
|
||||
Provider-specific version of the collation. This is recorded when the
|
||||
collation is created and then checked when it is used, to detect
|
||||
changes in the collation definition that could lead to data corruption.
|
||||
</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
@ -500,20 +500,46 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
<title>Managing Collations</title>
|
||||
|
||||
<para>
|
||||
A collation is an SQL schema object that maps an SQL name to
|
||||
operating system locales. In particular, it maps to a combination
|
||||
of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>. (As
|
||||
A collation is an SQL schema object that maps an SQL name to locales
|
||||
provided by libraries installed in the operating system. A collation
|
||||
definition has a <firstterm>provider</firstterm> that specifies which
|
||||
library supplies the locale data. One standard provider name
|
||||
is <literal>libc</literal>, which uses the locales provided by the
|
||||
operating system C library. These are the locales that most tools
|
||||
provided by the operating system use. Another provider
|
||||
is <literal>icu</literal>, which uses the external
|
||||
ICU<indexterm><primary>ICU</></> library. Support for ICU has to be
|
||||
configured when PostgreSQL is built.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A collation object provided by <literal>libc</literal> maps to a
|
||||
combination of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>
|
||||
settings. (As
|
||||
the name would suggest, the main purpose of a collation is to set
|
||||
<symbol>LC_COLLATE</symbol>, which controls the sort order. But
|
||||
it is rarely necessary in practice to have an
|
||||
<symbol>LC_CTYPE</symbol> setting that is different from
|
||||
<symbol>LC_COLLATE</symbol>, so it is more convenient to collect
|
||||
these under one concept than to create another infrastructure for
|
||||
setting <symbol>LC_CTYPE</symbol> per expression.) Also, a collation
|
||||
setting <symbol>LC_CTYPE</symbol> per expression.) Also,
|
||||
a <literal>libc</literal> collation
|
||||
is tied to a character set encoding (see <xref linkend="multibyte">).
|
||||
The same collation name may exist for different encodings.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
A collation provided by <literal>icu</literal> maps to a named collator
|
||||
provided by the ICU library. ICU does not support
|
||||
separate <quote>collate</quote> and <quote>ctype</quote> settings, so they
|
||||
are always the same. Also, ICU collations are independent of the
|
||||
encoding, so there is always only one ICU collation for a given name in a
|
||||
database.
|
||||
</para>
|
||||
|
||||
<sect3>
|
||||
<title>Standard Collations</title>
|
||||
|
||||
<para>
|
||||
On all platforms, the collations named <literal>default</>,
|
||||
<literal>C</>, and <literal>POSIX</> are available. Additional
|
||||
@ -527,13 +553,37 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
code byte values.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Additionally, the SQL standard collation name <literal>ucs_basic</literal>
|
||||
is available for encoding <literal>UTF8</literal>. It is equivalent
|
||||
to <literal>C</literal> and sorts by Unicode code point.
|
||||
</para>
|
||||
</sect3>
|
||||
|
||||
<sect3>
|
||||
<title>Predefined Collations</title>
|
||||
|
||||
<para>
|
||||
If the operating system provides support for using multiple locales
|
||||
within a single program (<function>newlocale</> and related functions),
|
||||
or support for ICU is configured,
|
||||
then when a database cluster is initialized, <command>initdb</command>
|
||||
populates the system catalog <literal>pg_collation</literal> with
|
||||
collations based on all the locales it finds on the operating
|
||||
system at the time. For example, the operating system might
|
||||
system at the time.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
To inspect the currently available locales, use the query <literal>SELECT
|
||||
* FROM pg_collation</literal>, or the command <command>\dOS+</command>
|
||||
in <application>psql</application>.
|
||||
</para>
|
||||
|
||||
<sect4>
|
||||
<title>libc collations</title>
|
||||
|
||||
<para>
|
||||
For example, the operating system might
|
||||
provide a locale named <literal>de_DE.utf8</literal>.
|
||||
<command>initdb</command> would then create a collation named
|
||||
<literal>de_DE.utf8</literal> for encoding <literal>UTF8</literal>
|
||||
@ -548,13 +598,14 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
</para>
|
||||
|
||||
<para>
|
||||
In case a collation is needed that has different values for
|
||||
<symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, a new
|
||||
collation may be created using
|
||||
the <xref linkend="sql-createcollation"> command. That command
|
||||
can also be used to create a new collation from an existing
|
||||
collation, which can be useful to be able to use
|
||||
operating-system-independent collation names in applications.
|
||||
The default set of collations provided by <literal>libc</literal> map
|
||||
directly to the locales installed in the operating system, which can be
|
||||
listed using the command <literal>locale -a</literal>. In case
|
||||
a <literal>libc</literal> collation is needed that has different values
|
||||
for <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or new
|
||||
locales are installed in the operating system after the database system
|
||||
was initialized, then a new collation may be created using
|
||||
the <xref linkend="sql-createcollation"> command.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -566,8 +617,8 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
Use of the stripped collation names is recommended, since it will
|
||||
make one less thing you need to change if you decide to change to
|
||||
another database encoding. Note however that the <literal>default</>,
|
||||
<literal>C</>, and <literal>POSIX</> collations can be used
|
||||
regardless of the database encoding.
|
||||
<literal>C</>, and <literal>POSIX</> collations, as well as all collations
|
||||
provided by ICU can be used regardless of the database encoding.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -581,6 +632,104 @@ SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
|
||||
collations have identical behaviors. Mixing stripped and non-stripped
|
||||
collation names is therefore not recommended.
|
||||
</para>
|
||||
</sect4>
|
||||
|
||||
<sect4>
|
||||
<title>ICU collations</title>
|
||||
|
||||
<para>
|
||||
Collations provided by ICU are created with names in BCP 47 language tag
|
||||
format, with a <quote>private use</quote>
|
||||
extension <literal>-x-icu</literal> appended, to distinguish them from
|
||||
libc locales. So <literal>de-x-icu</literal> would be an example.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
With ICU, it is not sensible to enumerate all possible locale names. ICU
|
||||
uses a particular naming system for locales, but there are many more ways
|
||||
to name a locale than there are actually distinct locales. (In fact, any
|
||||
string will be accepted as a locale name.)
|
||||
See <ulink url="http://userguide.icu-project.org/locale"></ulink> for
|
||||
information on ICU locale naming. <command>initdb</command> uses the ICU
|
||||
APIs to extract a set of locales with distinct collation rules to populate
|
||||
the initial set of collations. Here are some examples collations that
|
||||
might be created:
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><literal>de-x-icu</literal></term>
|
||||
<listitem>
|
||||
<para>German collation, default variant</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>de-u-co-phonebk-x-icu</literal></term>
|
||||
<listitem>
|
||||
<para>German collation, phone book variant</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>de-AT-x-icu</literal></term>
|
||||
<listitem>
|
||||
<para>German collation for Austria, default variant</para>
|
||||
<para>
|
||||
(Note that as of this writing, there is no,
|
||||
say, <literal>de-DE-x-icu</literal> or <literal>de-CH-x-icu</literal>,
|
||||
because those are equivalent to <literal>de-x-icu</literal>.)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>de-AT-u-co-phonebk-x-icu</literal></term>
|
||||
<listitem>
|
||||
<para>German collation for Austria, phone book variant</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term>
|
||||
<listitem>
|
||||
<para>
|
||||
ICU <quote>root</quote> collation. Use this to get a reasonable
|
||||
language-agnostic sort order.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Some (less frequently used) encodings are not supported by ICU. If the
|
||||
database cluster was initialized with such an encoding, no ICU collations
|
||||
will be predefined.
|
||||
</para>
|
||||
</sect4>
|
||||
</sect3>
|
||||
|
||||
<sect3>
|
||||
<title>Copying Collations</title>
|
||||
|
||||
<para>
|
||||
The command <xref linkend="sql-createcollation"> can also be used to
|
||||
create a new collation from an existing collation, which can be useful to
|
||||
be able to use operating-system-independent collation names in
|
||||
applications, create compatibility names, or use an ICU-provided collation
|
||||
under a more readable name. For example:
|
||||
<programlisting>
|
||||
CREATE COLLATION german FROM "de_DE";
|
||||
CREATE COLLATION french FROM "fr-x-icu";
|
||||
CREATE COLLATION "de-DE-x-icu" FROM "de-x-icu";
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The standard and predefined collations are in the
|
||||
schema <literal>pg_catalog</literal>, like all predefined objects.
|
||||
User-defined collations should be created in user schemas. This also
|
||||
ensures that they are saved by <command>pg_dump</command>.
|
||||
</para>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
|
@ -19545,6 +19545,14 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
|
||||
</thead>
|
||||
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm><primary>pg_collation_actual_version</primary></indexterm>
|
||||
<literal><function>pg_collation_actual_version(<type>oid</>)</function></literal>
|
||||
</entry>
|
||||
<entry><type>text</type></entry>
|
||||
<entry>Return actual version of collation from operating system</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>
|
||||
<indexterm><primary>pg_import_system_collations</primary></indexterm>
|
||||
@ -19557,6 +19565,15 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<para>
|
||||
<function>pg_collation_actual_version</function> returns the actual
|
||||
version of the collation object as it is currently installed in the
|
||||
operating system. If this is different from the value
|
||||
in <literal>pg_collation.collversion</literal>, then objects depending on
|
||||
the collation might need to be rebuilt. See also
|
||||
<xref linkend="sql-altercollation">.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
<function>pg_import_system_collations</> populates the system
|
||||
catalog <literal>pg_collation</literal> with collations based on all the
|
||||
|
@ -766,6 +766,21 @@ su - postgres
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>--with-icu</option></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Build with support for
|
||||
the <productname>ICU</productname><indexterm><primary>ICU</></>
|
||||
library. This requires the <productname>ICU4C</productname> package
|
||||
as well
|
||||
as <productname>pkg-config</productname><indexterm><primary>pkg-config</></>
|
||||
to be installed. The minimum required version
|
||||
of <productname>ICU4C</productname> is currently 4.6.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><option>--with-openssl</option>
|
||||
<indexterm>
|
||||
|
@ -967,7 +967,8 @@ ERROR: could not serialize access due to read/write dependencies among transact
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Acquired by <command>CREATE TRIGGER</command> and many forms of
|
||||
Acquired by <command>CREATE COLLATION</command>,
|
||||
<command>CREATE TRIGGER</command>, and many forms of
|
||||
<command>ALTER TABLE</command> (see <xref linkend="SQL-ALTERTABLE">).
|
||||
</para>
|
||||
</listitem>
|
||||
|
@ -21,6 +21,8 @@ PostgreSQL documentation
|
||||
|
||||
<refsynopsisdiv>
|
||||
<synopsis>
|
||||
ALTER COLLATION <replaceable>name</replaceable> REFRESH VERSION
|
||||
|
||||
ALTER COLLATION <replaceable>name</replaceable> RENAME TO <replaceable>new_name</replaceable>
|
||||
ALTER COLLATION <replaceable>name</replaceable> OWNER TO { <replaceable>new_owner</replaceable> | CURRENT_USER | SESSION_USER }
|
||||
ALTER COLLATION <replaceable>name</replaceable> SET SCHEMA <replaceable>new_schema</replaceable>
|
||||
@ -85,9 +87,62 @@ ALTER COLLATION <replaceable>name</replaceable> SET SCHEMA <replaceable>new_sche
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><literal>REFRESH VERSION</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Updated the collation version.
|
||||
See <xref linkend="sql-altercollation-notes"> below.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</refsect1>
|
||||
|
||||
<refsect1 id="sql-altercollation-notes">
|
||||
<title>Notes</title>
|
||||
|
||||
<para>
|
||||
When using collations provided by the ICU library, the ICU-specific version
|
||||
of the collator is recorded in the system catalog when the collation object
|
||||
is created. When the collation is then used, the current version is
|
||||
checked against the recorded version, and a warning is issued when there is
|
||||
a mismatch, for example:
|
||||
<screen>
|
||||
WARNING: ICU collator version mismatch
|
||||
DETAIL: The database was created using version 1.2.3.4, the library provides version 2.3.4.5.
|
||||
HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right version of ICU.
|
||||
</screen>
|
||||
A change in collation definitions can lead to corrupt indexes and other
|
||||
problems where the database system relies on stored objects having a
|
||||
certain sort order. Generally, this should be avoided, but it can happen
|
||||
in legitimate circumstances, such as when
|
||||
using <command>pg_upgrade</command> to upgrade to server binaries linked
|
||||
with a newer version of ICU. When this happens, all objects depending on
|
||||
the collation should be rebuilt, for example,
|
||||
using <command>REINDEX</command>. When that is done, the collation version
|
||||
can be refreshed using the command <literal>ALTER COLLATION ... REFRESH
|
||||
VERSION</literal>. This will update the system catalog to record the
|
||||
current collator version and will make the warning go away. Note that this
|
||||
does not actually check whether all affected objects have been rebuilt
|
||||
correctly.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The following query can be used to identify all collations in the current
|
||||
database that need to be refreshed and the objects that depend on them:
|
||||
<programlisting><![CDATA[
|
||||
SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
|
||||
pg_describe_object(classid, objid, objsubid) AS "Object"
|
||||
FROM pg_depend d JOIN pg_collation c
|
||||
ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
|
||||
WHERE c.collversion <> pg_collation_actual_version(c.oid)
|
||||
ORDER BY 1, 2;
|
||||
]]></programlisting>
|
||||
</para>
|
||||
</refsect1>
|
||||
|
||||
<refsect1>
|
||||
<title>Examples</title>
|
||||
|
||||
|
@ -21,7 +21,9 @@
|
||||
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> (
|
||||
[ LOCALE = <replaceable>locale</replaceable>, ]
|
||||
[ LC_COLLATE = <replaceable>lc_collate</replaceable>, ]
|
||||
[ LC_CTYPE = <replaceable>lc_ctype</replaceable> ]
|
||||
[ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ]
|
||||
[ PROVIDER = <replaceable>provider</replaceable>, ]
|
||||
[ VERSION = <replaceable>version</replaceable> ]
|
||||
)
|
||||
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable>
|
||||
</synopsis>
|
||||
@ -113,6 +115,39 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><replaceable>provider</replaceable></term>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Specifies the provider to use for locale services associated with this
|
||||
collation. Possible values
|
||||
are: <literal>icu</literal>,<indexterm><primary>ICU</></> <literal>libc</literal>.
|
||||
The available choices depend on the operating system and build options.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><replaceable>version</replaceable></term>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
Specifies the version string to store with the collation. Normally,
|
||||
this should be omitted, which will cause the version to be computed
|
||||
from the actual version of the collation as provided by the operating
|
||||
system. This option is intended to be used
|
||||
by <command>pg_upgrade</command> for copying the version from an
|
||||
existing installation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
See also <xref linkend="sql-altercollation"> for how to handle
|
||||
collation version mismatches.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry>
|
||||
<term><replaceable>existing_collation</replaceable></term>
|
||||
|
||||
|
Reference in New Issue
Block a user