1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

ICU support

Add a column collprovider to pg_collation that determines which library
provides the collation data.  The existing choices are default and libc,
and this adds an icu choice, which uses the ICU4C library.

The pg_locale_t type is changed to a union that contains the
provider-specific locale handles.  Users of locale information are
changed to look into that struct for the appropriate handle to use.

Also add a collversion column that records the version of the collation
when it is created, and check at run time whether it is still the same.
This detects potentially incompatible library upgrades that can corrupt
indexes and other structures.  This is currently only supported by
ICU-provided collations.

initdb initializes the default collation set as before from the `locale
-a` output but also adds all available ICU locales with a "-x-icu"
appended.

Currently, ICU-provided collations can only be explicitly named
collations.  The global database locales are still always libc-provided.

ICU support is enabled by configure --with-icu.

Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
This commit is contained in:
Peter Eisentraut
2017-03-23 15:25:34 -04:00
parent ea42cc18c3
commit eccfef81e1
45 changed files with 3957 additions and 437 deletions

View File

@ -2020,6 +2020,14 @@
<entry>Owner of the collation</entry>
</row>
<row>
<entry><structfield>collprovider</structfield></entry>
<entry><type>char</type></entry>
<entry></entry>
<entry>Provider of the collation: <literal>d</literal> = database
default, <literal>c</literal> = libc, <literal>i</literal> = icu</entry>
</row>
<row>
<entry><structfield>collencoding</structfield></entry>
<entry><type>int4</type></entry>
@ -2041,6 +2049,17 @@
<entry></entry>
<entry><symbol>LC_CTYPE</> for this collation object</entry>
</row>
<row>
<entry><structfield>collversion</structfield></entry>
<entry><type>text</type></entry>
<entry></entry>
<entry>
Provider-specific version of the collation. This is recorded when the
collation is created and then checked when it is used, to detect
changes in the collation definition that could lead to data corruption.
</entry>
</row>
</tbody>
</tgroup>
</table>

View File

@ -500,20 +500,46 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
<title>Managing Collations</title>
<para>
A collation is an SQL schema object that maps an SQL name to
operating system locales. In particular, it maps to a combination
of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>. (As
A collation is an SQL schema object that maps an SQL name to locales
provided by libraries installed in the operating system. A collation
definition has a <firstterm>provider</firstterm> that specifies which
library supplies the locale data. One standard provider name
is <literal>libc</literal>, which uses the locales provided by the
operating system C library. These are the locales that most tools
provided by the operating system use. Another provider
is <literal>icu</literal>, which uses the external
ICU<indexterm><primary>ICU</></> library. Support for ICU has to be
configured when PostgreSQL is built.
</para>
<para>
A collation object provided by <literal>libc</literal> maps to a
combination of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>
settings. (As
the name would suggest, the main purpose of a collation is to set
<symbol>LC_COLLATE</symbol>, which controls the sort order. But
it is rarely necessary in practice to have an
<symbol>LC_CTYPE</symbol> setting that is different from
<symbol>LC_COLLATE</symbol>, so it is more convenient to collect
these under one concept than to create another infrastructure for
setting <symbol>LC_CTYPE</symbol> per expression.) Also, a collation
setting <symbol>LC_CTYPE</symbol> per expression.) Also,
a <literal>libc</literal> collation
is tied to a character set encoding (see <xref linkend="multibyte">).
The same collation name may exist for different encodings.
</para>
<para>
A collation provided by <literal>icu</literal> maps to a named collator
provided by the ICU library. ICU does not support
separate <quote>collate</quote> and <quote>ctype</quote> settings, so they
are always the same. Also, ICU collations are independent of the
encoding, so there is always only one ICU collation for a given name in a
database.
</para>
<sect3>
<title>Standard Collations</title>
<para>
On all platforms, the collations named <literal>default</>,
<literal>C</>, and <literal>POSIX</> are available. Additional
@ -527,13 +553,37 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
code byte values.
</para>
<para>
Additionally, the SQL standard collation name <literal>ucs_basic</literal>
is available for encoding <literal>UTF8</literal>. It is equivalent
to <literal>C</literal> and sorts by Unicode code point.
</para>
</sect3>
<sect3>
<title>Predefined Collations</title>
<para>
If the operating system provides support for using multiple locales
within a single program (<function>newlocale</> and related functions),
or support for ICU is configured,
then when a database cluster is initialized, <command>initdb</command>
populates the system catalog <literal>pg_collation</literal> with
collations based on all the locales it finds on the operating
system at the time. For example, the operating system might
system at the time.
</para>
<para>
To inspect the currently available locales, use the query <literal>SELECT
* FROM pg_collation</literal>, or the command <command>\dOS+</command>
in <application>psql</application>.
</para>
<sect4>
<title>libc collations</title>
<para>
For example, the operating system might
provide a locale named <literal>de_DE.utf8</literal>.
<command>initdb</command> would then create a collation named
<literal>de_DE.utf8</literal> for encoding <literal>UTF8</literal>
@ -548,13 +598,14 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
</para>
<para>
In case a collation is needed that has different values for
<symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, a new
collation may be created using
the <xref linkend="sql-createcollation"> command. That command
can also be used to create a new collation from an existing
collation, which can be useful to be able to use
operating-system-independent collation names in applications.
The default set of collations provided by <literal>libc</literal> map
directly to the locales installed in the operating system, which can be
listed using the command <literal>locale -a</literal>. In case
a <literal>libc</literal> collation is needed that has different values
for <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or new
locales are installed in the operating system after the database system
was initialized, then a new collation may be created using
the <xref linkend="sql-createcollation"> command.
</para>
<para>
@ -566,8 +617,8 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
Use of the stripped collation names is recommended, since it will
make one less thing you need to change if you decide to change to
another database encoding. Note however that the <literal>default</>,
<literal>C</>, and <literal>POSIX</> collations can be used
regardless of the database encoding.
<literal>C</>, and <literal>POSIX</> collations, as well as all collations
provided by ICU can be used regardless of the database encoding.
</para>
<para>
@ -581,6 +632,104 @@ SELECT a COLLATE "C" &lt; b COLLATE "POSIX" FROM test1;
collations have identical behaviors. Mixing stripped and non-stripped
collation names is therefore not recommended.
</para>
</sect4>
<sect4>
<title>ICU collations</title>
<para>
Collations provided by ICU are created with names in BCP 47 language tag
format, with a <quote>private use</quote>
extension <literal>-x-icu</literal> appended, to distinguish them from
libc locales. So <literal>de-x-icu</literal> would be an example.
</para>
<para>
With ICU, it is not sensible to enumerate all possible locale names. ICU
uses a particular naming system for locales, but there are many more ways
to name a locale than there are actually distinct locales. (In fact, any
string will be accepted as a locale name.)
See <ulink url="http://userguide.icu-project.org/locale"></ulink> for
information on ICU locale naming. <command>initdb</command> uses the ICU
APIs to extract a set of locales with distinct collation rules to populate
the initial set of collations. Here are some examples collations that
might be created:
<variablelist>
<varlistentry>
<term><literal>de-x-icu</literal></term>
<listitem>
<para>German collation, default variant</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>de-u-co-phonebk-x-icu</literal></term>
<listitem>
<para>German collation, phone book variant</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>de-AT-x-icu</literal></term>
<listitem>
<para>German collation for Austria, default variant</para>
<para>
(Note that as of this writing, there is no,
say, <literal>de-DE-x-icu</literal> or <literal>de-CH-x-icu</literal>,
because those are equivalent to <literal>de-x-icu</literal>.)
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>de-AT-u-co-phonebk-x-icu</literal></term>
<listitem>
<para>German collation for Austria, phone book variant</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term>
<listitem>
<para>
ICU <quote>root</quote> collation. Use this to get a reasonable
language-agnostic sort order.
</para>
</listitem>
</varlistentry>
</variablelist>
</para>
<para>
Some (less frequently used) encodings are not supported by ICU. If the
database cluster was initialized with such an encoding, no ICU collations
will be predefined.
</para>
</sect4>
</sect3>
<sect3>
<title>Copying Collations</title>
<para>
The command <xref linkend="sql-createcollation"> can also be used to
create a new collation from an existing collation, which can be useful to
be able to use operating-system-independent collation names in
applications, create compatibility names, or use an ICU-provided collation
under a more readable name. For example:
<programlisting>
CREATE COLLATION german FROM "de_DE";
CREATE COLLATION french FROM "fr-x-icu";
CREATE COLLATION "de-DE-x-icu" FROM "de-x-icu";
</programlisting>
</para>
<para>
The standard and predefined collations are in the
schema <literal>pg_catalog</literal>, like all predefined objects.
User-defined collations should be created in user schemas. This also
ensures that they are saved by <command>pg_dump</command>.
</para>
</sect2>
</sect1>

View File

@ -19545,6 +19545,14 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
</thead>
<tbody>
<row>
<entry>
<indexterm><primary>pg_collation_actual_version</primary></indexterm>
<literal><function>pg_collation_actual_version(<type>oid</>)</function></literal>
</entry>
<entry><type>text</type></entry>
<entry>Return actual version of collation from operating system</entry>
</row>
<row>
<entry>
<indexterm><primary>pg_import_system_collations</primary></indexterm>
@ -19557,6 +19565,15 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
</tgroup>
</table>
<para>
<function>pg_collation_actual_version</function> returns the actual
version of the collation object as it is currently installed in the
operating system. If this is different from the value
in <literal>pg_collation.collversion</literal>, then objects depending on
the collation might need to be rebuilt. See also
<xref linkend="sql-altercollation">.
</para>
<para>
<function>pg_import_system_collations</> populates the system
catalog <literal>pg_collation</literal> with collations based on all the

View File

@ -766,6 +766,21 @@ su - postgres
</listitem>
</varlistentry>
<varlistentry>
<term><option>--with-icu</option></term>
<listitem>
<para>
Build with support for
the <productname>ICU</productname><indexterm><primary>ICU</></>
library. This requires the <productname>ICU4C</productname> package
as well
as <productname>pkg-config</productname><indexterm><primary>pkg-config</></>
to be installed. The minimum required version
of <productname>ICU4C</productname> is currently 4.6.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><option>--with-openssl</option>
<indexterm>

View File

@ -967,7 +967,8 @@ ERROR: could not serialize access due to read/write dependencies among transact
</para>
<para>
Acquired by <command>CREATE TRIGGER</command> and many forms of
Acquired by <command>CREATE COLLATION</command>,
<command>CREATE TRIGGER</command>, and many forms of
<command>ALTER TABLE</command> (see <xref linkend="SQL-ALTERTABLE">).
</para>
</listitem>

View File

@ -21,6 +21,8 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
ALTER COLLATION <replaceable>name</replaceable> REFRESH VERSION
ALTER COLLATION <replaceable>name</replaceable> RENAME TO <replaceable>new_name</replaceable>
ALTER COLLATION <replaceable>name</replaceable> OWNER TO { <replaceable>new_owner</replaceable> | CURRENT_USER | SESSION_USER }
ALTER COLLATION <replaceable>name</replaceable> SET SCHEMA <replaceable>new_schema</replaceable>
@ -85,9 +87,62 @@ ALTER COLLATION <replaceable>name</replaceable> SET SCHEMA <replaceable>new_sche
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>REFRESH VERSION</literal></term>
<listitem>
<para>
Updated the collation version.
See <xref linkend="sql-altercollation-notes"> below.
</para>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="sql-altercollation-notes">
<title>Notes</title>
<para>
When using collations provided by the ICU library, the ICU-specific version
of the collator is recorded in the system catalog when the collation object
is created. When the collation is then used, the current version is
checked against the recorded version, and a warning is issued when there is
a mismatch, for example:
<screen>
WARNING: ICU collator version mismatch
DETAIL: The database was created using version 1.2.3.4, the library provides version 2.3.4.5.
HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right version of ICU.
</screen>
A change in collation definitions can lead to corrupt indexes and other
problems where the database system relies on stored objects having a
certain sort order. Generally, this should be avoided, but it can happen
in legitimate circumstances, such as when
using <command>pg_upgrade</command> to upgrade to server binaries linked
with a newer version of ICU. When this happens, all objects depending on
the collation should be rebuilt, for example,
using <command>REINDEX</command>. When that is done, the collation version
can be refreshed using the command <literal>ALTER COLLATION ... REFRESH
VERSION</literal>. This will update the system catalog to record the
current collator version and will make the warning go away. Note that this
does not actually check whether all affected objects have been rebuilt
correctly.
</para>
<para>
The following query can be used to identify all collations in the current
database that need to be refreshed and the objects that depend on them:
<programlisting><![CDATA[
SELECT pg_describe_object(refclassid, refobjid, refobjsubid) AS "Collation",
pg_describe_object(classid, objid, objsubid) AS "Object"
FROM pg_depend d JOIN pg_collation c
ON refclassid = 'pg_collation'::regclass AND refobjid = c.oid
WHERE c.collversion <> pg_collation_actual_version(c.oid)
ORDER BY 1, 2;
]]></programlisting>
</para>
</refsect1>
<refsect1>
<title>Examples</title>

View File

@ -21,7 +21,9 @@
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> (
[ LOCALE = <replaceable>locale</replaceable>, ]
[ LC_COLLATE = <replaceable>lc_collate</replaceable>, ]
[ LC_CTYPE = <replaceable>lc_ctype</replaceable> ]
[ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ]
[ PROVIDER = <replaceable>provider</replaceable>, ]
[ VERSION = <replaceable>version</replaceable> ]
)
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable>
</synopsis>
@ -113,6 +115,39 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable>provider</replaceable></term>
<listitem>
<para>
Specifies the provider to use for locale services associated with this
collation. Possible values
are: <literal>icu</literal>,<indexterm><primary>ICU</></> <literal>libc</literal>.
The available choices depend on the operating system and build options.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable>version</replaceable></term>
<listitem>
<para>
Specifies the version string to store with the collation. Normally,
this should be omitted, which will cause the version to be computed
from the actual version of the collation as provided by the operating
system. This option is intended to be used
by <command>pg_upgrade</command> for copying the version from an
existing installation.
</para>
<para>
See also <xref linkend="sql-altercollation"> for how to handle
collation version mismatches.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable>existing_collation</replaceable></term>