mirror of
https://github.com/postgres/postgres.git
synced 2025-04-24 10:47:04 +03:00
Doc improvements for language tags and custom ICU collations.
Separate the documentation for language tags themselves from the available collation settings which can be included in a language tag. Include tables of the available options, more details about the effects of each option, and additional examples. Also include an explanation of the "levels" of textual features and how they relate to collation. Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org Reviewed-by: Jonathan S. Katz
This commit is contained in:
parent
8a2523ff35
commit
1e16af8ab5
@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
|
||||
variants and customization options.
|
||||
</para>
|
||||
</sect2>
|
||||
<sect2 id="icu-locales">
|
||||
<title>ICU Locales</title>
|
||||
<sect3 id="icu-locale-names">
|
||||
<title>ICU Locale Names</title>
|
||||
<para>
|
||||
The ICU format for the locale name is a <link
|
||||
linkend="icu-language-tag">Language Tag</link>.
|
||||
|
||||
<programlisting>
|
||||
CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP');
|
||||
CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
|
||||
</programlisting>
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3 id="icu-canonicalization">
|
||||
<title>Locale Canonicalization and Validation</title>
|
||||
<para>
|
||||
When defining a new ICU collation object or database with ICU as the
|
||||
provider, the given locale name is transformed ("canonicalized") into a
|
||||
language tag if not already in that form. For instance,
|
||||
|
||||
<screen>
|
||||
CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
|
||||
NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
|
||||
CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
|
||||
NOTICE: using standard form "de-DE" for locale "de_DE.utf8"
|
||||
</screen>
|
||||
|
||||
If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
|
||||
<symbol>LOCALE</symbol> are the expected result. For consistent results
|
||||
when using the ICU provider, specify the canonical <link
|
||||
linkend="icu-language-tag">language tag</link> instead of relying on the
|
||||
transformation.
|
||||
</para>
|
||||
<para>
|
||||
A locale with no language name, or the special language name
|
||||
<literal>root</literal>, is transformed to have the language
|
||||
<literal>und</literal> ("undefined").
|
||||
</para>
|
||||
<para>
|
||||
ICU can transform most libc locale names, as well as some other formats,
|
||||
into language tags for easier transition to ICU. If a libc locale name is
|
||||
used in ICU, it may not have precisely the same behavior as in libc.
|
||||
</para>
|
||||
<para>
|
||||
If there is a problem interpreting the locale name, or if the locale name
|
||||
represents a language or region that ICU does not recognize, you will see
|
||||
the following warning:
|
||||
|
||||
<screen>
|
||||
CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
|
||||
WARNING: ICU locale "nonsense" has unknown language "nonsense"
|
||||
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
|
||||
CREATE COLLATION
|
||||
</screen>
|
||||
|
||||
<xref linkend="guc-icu-validation-level"/> controls how the message is
|
||||
reported. Unless set to <literal>ERROR</literal>, the collation will
|
||||
still be created, but the behavior may not be what the user intended.
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3 id="icu-language-tag">
|
||||
<title>Language Tag</title>
|
||||
<para>
|
||||
A language tag, defined in BCP 47, is a standardized identifier used to
|
||||
identify languages, regions, and other information about a locale.
|
||||
</para>
|
||||
<para>
|
||||
Basic language tags are simply
|
||||
<replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>;
|
||||
or even just <replaceable>language</replaceable>. The
|
||||
<replaceable>language</replaceable> is a language code
|
||||
(e.g. <literal>fr</literal> for French), and
|
||||
<replaceable>region</replaceable> is a region code
|
||||
(e.g. <literal>CA</literal> for Canada). Examples:
|
||||
<literal>ja-JP</literal>, <literal>de</literal>, or
|
||||
<literal>fr-CA</literal>.
|
||||
</para>
|
||||
<para>
|
||||
Collation settings may be included in the language tag to customize
|
||||
collation behavior. ICU allows extensive customization, such as
|
||||
sensitivity (or insensitivity) to accents, case, and punctuation;
|
||||
treatment of digits within text; and many other options to satisfy a
|
||||
variety of uses.
|
||||
</para>
|
||||
<para>
|
||||
To include this additional collation information in a language tag,
|
||||
append <literal>-u</literal>, which indicates there are additional
|
||||
collation settings, followed by one or more
|
||||
<literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable>
|
||||
pairs. The <replaceable>key</replaceable> is the key for a <link
|
||||
linkend="icu-collation-settings">collation setting</link> and
|
||||
<replaceable>value</replaceable> is a valid value for that setting. For
|
||||
boolean settings, the <literal>-</literal><replaceable>key</replaceable>
|
||||
may be specified without a corresponding
|
||||
<literal>-</literal><replaceable>value</replaceable>, which implies a
|
||||
value of <literal>true</literal>.
|
||||
</para>
|
||||
<para>
|
||||
For example, the language tag <literal>en-US-u-kn-ks-level2</literal>
|
||||
means the locale with the English language in the US region, with
|
||||
collation settings <literal>kn</literal> set to <literal>true</literal>
|
||||
and <literal>ks</literal> set to <literal>level2</literal>. Those
|
||||
settings mean the collation will be case-insensitive and treat a sequence
|
||||
of digits as a single number:
|
||||
|
||||
<screen>
|
||||
CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
|
||||
SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
|
||||
result
|
||||
--------
|
||||
t
|
||||
(1 row)
|
||||
|
||||
SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result;
|
||||
result
|
||||
--------
|
||||
t
|
||||
(1 row)
|
||||
</screen>
|
||||
</para>
|
||||
<para>
|
||||
See <xref linkend="icu-custom-collations"/> for details and additional
|
||||
examples of using language tags with custom collation information for the
|
||||
locale.
|
||||
</para>
|
||||
</sect3>
|
||||
</sect2>
|
||||
<sect2 id="locale-problems">
|
||||
<title>Problems</title>
|
||||
|
||||
@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
|
||||
code byte values.
|
||||
</para>
|
||||
|
||||
<note>
|
||||
<para>
|
||||
The <literal>C</literal> and <literal>POSIX</literal> locales may behave
|
||||
differently depending on the database encoding.
|
||||
</para>
|
||||
</note>
|
||||
|
||||
<para>
|
||||
Additionally, two SQL standard collation names are available:
|
||||
|
||||
@ -870,131 +1004,23 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
|
||||
<title>ICU Collations</title>
|
||||
|
||||
<para>
|
||||
ICU allows collations to be customized beyond the basic language+country
|
||||
set that is preloaded by <command>initdb</command>. Users are encouraged
|
||||
to define their own collation objects that make use of these facilities to
|
||||
suit the sorting behavior to their requirements.
|
||||
See <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
|
||||
and <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink> for
|
||||
information on ICU locale naming. The set of acceptable names and
|
||||
attributes depends on the particular ICU version.
|
||||
</para>
|
||||
ICU collations can be created like:
|
||||
|
||||
<para>
|
||||
Here are some examples:
|
||||
<programlisting>
|
||||
CREATE COLLATION german (provider = icu, locale = 'de-DE');
|
||||
</programlisting>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
|
||||
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
|
||||
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
|
||||
<listitem>
|
||||
<para>German collation with phone book collation type</para>
|
||||
<para>
|
||||
The first example selects the ICU locale using a <quote>language
|
||||
tag</quote> per BCP 47. The second example uses the traditional
|
||||
ICU-specific locale syntax. The first style is preferred going
|
||||
forward, and is used internally to store locales.
|
||||
ICU locales are specified as a BCP 47 <link
|
||||
linkend="icu-language-tag">Language Tag</link>, but can also accept most
|
||||
libc-style locale names. If possible, libc-style locale names are
|
||||
transformed into language tags.
|
||||
</para>
|
||||
<para>
|
||||
Note that you can name the collation objects in the SQL environment
|
||||
anything you want. In this example, we follow the naming style that
|
||||
the predefined collations use, which in turn also follow BCP 47, but
|
||||
that is not required for user-defined collations.
|
||||
New ICU collations can customize collation behavior extensively by
|
||||
including collation attributes in the langugage tag. See <xref
|
||||
linkend="icu-custom-collations"/> for details and examples.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
|
||||
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
|
||||
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Root collation with Emoji collation type, per Unicode Technical Standard #51
|
||||
</para>
|
||||
<para>
|
||||
Observe how in the traditional ICU locale naming system, the root
|
||||
locale is selected by an empty string.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
|
||||
<term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
|
||||
<term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Sort Greek letters before Latin ones. (The default is Latin before Greek.)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kf-upper">
|
||||
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
|
||||
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Sort upper-case letters before lower-case letters. (The default is
|
||||
lower-case letters first.)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
|
||||
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
|
||||
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Combines both of the above options.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kn-true">
|
||||
<term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
|
||||
<term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Numeric ordering, sorts sequences of digits by their numeric value,
|
||||
for example: <literal>A-21</literal> < <literal>A-123</literal>
|
||||
(also known as natural sort).
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
|
||||
See <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
|
||||
Technical Standard #35</ulink>
|
||||
and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
|
||||
details. The list of possible collation types (<literal>co</literal>
|
||||
subtag) can be found in
|
||||
the <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
|
||||
repository</ulink>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Note that while this system allows creating collations that <quote>ignore
|
||||
case</quote> or <quote>ignore accents</quote> or similar (using the
|
||||
<literal>ks</literal> key), in order for such collations to act in a
|
||||
truly case- or accent-insensitive manner, they also need to be declared as not
|
||||
<firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
|
||||
see <xref linkend="collation-nondeterministic"/>.
|
||||
Otherwise, any strings that compare equal according to the collation but
|
||||
are not byte-wise equal will be sorted according to their byte values.
|
||||
</para>
|
||||
|
||||
<note>
|
||||
<para>
|
||||
By design, ICU will accept almost any string as a locale name and match
|
||||
it to the closest locale it can provide, using the fallback procedure
|
||||
described in its documentation. Thus, there will be no direct feedback
|
||||
if a collation specification is composed using features that the given
|
||||
ICU installation does not actually support. It is therefore recommended
|
||||
to create application-level test cases to check that the collation
|
||||
definitions satisfy one's requirements.
|
||||
</para>
|
||||
</note>
|
||||
</sect4>
|
||||
|
||||
<sect4 id="collation-copy">
|
||||
<title>Copying Collations</title>
|
||||
|
||||
@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
|
||||
</tip>
|
||||
</sect3>
|
||||
</sect2>
|
||||
<sect2 id="icu-custom-collations">
|
||||
<title>ICU Custom Collations</title>
|
||||
|
||||
<para>
|
||||
ICU allows extensive control over collation behavior by defining new
|
||||
collations with collation settings as a part of the language tag. These
|
||||
settings can modify the collation order to suit a variety of needs. For
|
||||
instance:
|
||||
|
||||
<programlisting>
|
||||
-- ignore differences in accents and case
|
||||
CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
|
||||
SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
|
||||
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
|
||||
|
||||
-- upper case letters sort before lower case.
|
||||
CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
|
||||
SELECT 'B' < 'b' COLLATE upper_first; -- true
|
||||
|
||||
-- treat digits numerically and ignore punctuation
|
||||
CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
|
||||
SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
|
||||
SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
|
||||
</programlisting>
|
||||
|
||||
Many of the available options are described in <xref
|
||||
linkend="icu-collation-settings"/>, or see <xref
|
||||
linkend="icu-external-references"/> for more details.
|
||||
</para>
|
||||
<sect3 id="icu-collation-comparison-levels">
|
||||
<title>ICU Comparison Levels</title>
|
||||
<para>
|
||||
Comparison of two strings (collation) in ICU is determined by a
|
||||
multi-level process, where textual features are grouped into
|
||||
"levels". Treatment of each level is controlled by the <link
|
||||
linkend="icu-collation-settings-table">collation settings</link>. Higher
|
||||
levels correspond to finer textual features.
|
||||
</para>
|
||||
<para>
|
||||
<table id="icu-collation-levels">
|
||||
<title>ICU Collation Levels</title>
|
||||
<tgroup cols="3">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>Level</entry>
|
||||
<entry>Description</entry>
|
||||
<entry><literal>'f' = 'f'</literal></entry>
|
||||
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
|
||||
<entry><literal>'x-y' = 'x_y'</literal></entry>
|
||||
<entry><literal>'g' = 'G'</literal></entry>
|
||||
<entry><literal>'n' = 'ñ'</literal></entry>
|
||||
<entry><literal>'y' = 'z'</literal></entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>level1</entry>
|
||||
<entry>Base Character</entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>level2</entry>
|
||||
<entry>Accents</entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>level3</entry>
|
||||
<entry>Case/Variants</entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>level4</entry>
|
||||
<entry>Punctuation</entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>identic</entry>
|
||||
<entry>All</entry>
|
||||
<entry><literal>true</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
The above table shows which textual feature differences are
|
||||
considered significant when determining equality at the given level. The
|
||||
unicode character <literal>U+2063</literal> is an invisible separator,
|
||||
and as seen in the table, is ignored for at all levels of comparison less
|
||||
than <literal>identic</literal>.
|
||||
</para>
|
||||
<para>
|
||||
At every level, even with full normalization off, basic normalization is
|
||||
performed. For example, <literal>'á'</literal> may be composed of the
|
||||
code points <literal>U&'\0061\0301'</literal> or the single code
|
||||
point <literal>U&'\00E1'</literal>, and those sequences will be
|
||||
considered equal even at the <literal>identic</literal> level. To treat
|
||||
any difference in code point representation as distinct, use a collation
|
||||
created with <symbol>DETERMINISTIC</symbol> set to
|
||||
<literal>true</literal>.
|
||||
</para>
|
||||
<sect4 id="icu-collation-level-examples">
|
||||
<title>Collation Level Examples</title>
|
||||
<para>
|
||||
|
||||
<programlisting>
|
||||
CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
|
||||
CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
|
||||
CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
|
||||
|
||||
-- invisible separator ignored at all levels except identic
|
||||
SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
|
||||
SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
|
||||
|
||||
-- punctuation ignored at level3 but not at level 4
|
||||
SELECT 'x-y' = 'x_y' COLLATE level3; -- true
|
||||
SELECT 'x-y' = 'x_y' COLLATE level4; -- false
|
||||
</programlisting>
|
||||
|
||||
</para>
|
||||
</sect4>
|
||||
</sect3>
|
||||
<sect3 id="icu-collation-settings">
|
||||
<title>Collation Settings for an ICU Locale</title>
|
||||
<para>
|
||||
<table id="icu-collation-settings-table">
|
||||
<title>ICU Collation Settings</title>
|
||||
<tgroup cols="4">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>Key</entry>
|
||||
<entry>Values</entry>
|
||||
<entry>Default</entry>
|
||||
<entry>Description</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry><literal>ks</literal></entry>
|
||||
<entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry>
|
||||
<entry><literal>level3</literal></entry>
|
||||
<entry>
|
||||
Sensitivity (or "strength") when determining equality, with
|
||||
<literal>level1</literal> the least sensitive to differences and
|
||||
<literal>identic</literal> the most sensitive to differences. See
|
||||
<xref linkend="icu-collation-levels"/> for details.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>ka</literal></entry>
|
||||
<entry><literal>noignore</literal>, <literal>shifted</literal></entry>
|
||||
<entry><literal>noignore</literal></entry>
|
||||
<entry>
|
||||
If set to <literal>shifted</literal>, causes some characters
|
||||
(e.g. punctuation or space) to be ignored in comparison. Key
|
||||
<literal>ks</literal> must be set to <literal>level3</literal> or
|
||||
lower to take effect. Set key <literal>kv</literal> to control which
|
||||
character classes are ignored.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kb</literal></entry>
|
||||
<entry><literal>true</literal>, <literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry>
|
||||
Backwards comparison for the level 2 differences. For example,
|
||||
locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
|
||||
before <literal>'aé'</literal>.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kk</literal></entry>
|
||||
<entry><literal>true</literal>, <literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry>
|
||||
<para>
|
||||
Enable full normalization; may affect performance. Basic
|
||||
normalization is performed even when set to
|
||||
<literal>false</literal>. Locales for languages that require full
|
||||
normalization typically enable it by default.
|
||||
</para>
|
||||
<para>
|
||||
Full normalization is important in some cases, such as when
|
||||
multiple accents are applied to a single character. For instance,
|
||||
<literal>'ệ'</literal> can be composed of code points
|
||||
<literal>U&'\0065\0323\0302'</literal> or
|
||||
<literal>U&'\0065\0302\0323'</literal>. With full normalization
|
||||
on, these code point sequences are treated as equal; otherwise they
|
||||
are unequal.
|
||||
</para>
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kc</literal></entry>
|
||||
<entry><literal>true</literal>, <literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry>
|
||||
<para>
|
||||
Separates case into a "level 2.5" that falls between accents and
|
||||
other level 3 features.
|
||||
</para>
|
||||
<para>
|
||||
If set to <literal>true</literal> and <literal>ks</literal> is set
|
||||
to <literal>level1</literal>, will ignore accents but take case
|
||||
into account.
|
||||
</para>
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kf</literal></entry>
|
||||
<entry>
|
||||
<literal>upper</literal>, <literal>lower</literal>,
|
||||
<literal>false</literal>
|
||||
</entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry>
|
||||
If set to <literal>upper</literal>, upper case sorts before lower
|
||||
case. If set to <literal>lower</literal>, lower case sorts before
|
||||
upper case. If set to <literal>false</literal>, the sort depends on
|
||||
the rules of the locale.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kn</literal></entry>
|
||||
<entry><literal>true</literal>, <literal>false</literal></entry>
|
||||
<entry><literal>false</literal></entry>
|
||||
<entry>
|
||||
If set to <literal>true</literal>, numbers within a string are
|
||||
treated as a single numeric value rather than a sequence of
|
||||
digits. For example, <literal>'id-45'</literal> sorts before
|
||||
<literal>'id-123'</literal>.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kr</literal></entry>
|
||||
<entry>
|
||||
<literal>space</literal>, <literal>punct</literal>,
|
||||
<literal>symbol</literal>, <literal>currency</literal>,
|
||||
<literal>digit</literal>, <replaceable>script-id</replaceable>
|
||||
</entry>
|
||||
<entry></entry>
|
||||
<entry>
|
||||
<para>
|
||||
Set to one or more of the valid values, or any BCP 47
|
||||
<replaceable>script-id</replaceable>, e.g. <literal>latn</literal>
|
||||
("Latin") or <literal>grek</literal> ("Greek"). Multiple values are
|
||||
separated by "<literal>-</literal>".
|
||||
</para>
|
||||
<para>
|
||||
Redefines the ordering of classes of characters; those characters
|
||||
belonging to a class earlier in the list sort before characters
|
||||
belonging to a class later in the list. For instance, the value
|
||||
<literal>digit-currency-space</literal> (as part of a language tag
|
||||
like <literal>und-u-kr-digit-currency-space</literal>) sorts
|
||||
punctuation before digits and spaces.
|
||||
</para>
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>kv</literal></entry>
|
||||
<entry>
|
||||
<literal>space</literal>, <literal>punct</literal>,
|
||||
<literal>symbol</literal>, <literal>currency</literal>
|
||||
</entry>
|
||||
<entry><literal>punct</literal></entry>
|
||||
<entry>
|
||||
Classes of characters ignored during comparison at level 3. Setting
|
||||
to a later value includes earlier values;
|
||||
e.g. <literal>symbol</literal> also includes
|
||||
<literal>punct</literal> and <literal>space</literal> in the
|
||||
characters to be ignored. Key <literal>ka</literal> must be set to
|
||||
<literal>shifted</literal> and key <literal>ks</literal> must be set
|
||||
to <literal>level3</literal> or lower to take effect.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><literal>co</literal></entry>
|
||||
<entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry>
|
||||
<entry><literal>standard</literal></entry>
|
||||
<entry>
|
||||
Collation type. See <xref linkend="icu-external-references"/> for additional options and details.
|
||||
</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
Defaults may depend on locale. The above table is not meant to be
|
||||
complete. See <xref linkend="icu-external-references"/> for additional
|
||||
options and details.
|
||||
</para>
|
||||
<note>
|
||||
<para>
|
||||
For many collation settings, you must create the collation with
|
||||
<option>DETERMINISTIC</option> set to <literal>false</literal> for the
|
||||
setting to have the desired effect (see <xref
|
||||
linkend="collation-nondeterministic"/>). Additionally, some settings
|
||||
only take effect when the key <literal>ka</literal> is set to
|
||||
<literal>shifted</literal> (see <xref
|
||||
linkend="icu-collation-settings-table"/>).
|
||||
</para>
|
||||
</note>
|
||||
</sect3>
|
||||
<sect3 id="icu-locale-examples">
|
||||
<title>Examples</title>
|
||||
<para>
|
||||
<variablelist>
|
||||
<varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
|
||||
<term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
|
||||
<listitem>
|
||||
<para>German collation with phone book collation type</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
|
||||
<term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Root collation with Emoji collation type, per Unicode Technical Standard #51
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
|
||||
<term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Sort Greek letters before Latin ones. (The default is Latin before Greek.)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kf-upper">
|
||||
<term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Sort upper-case letters before lower-case letters. (The default is
|
||||
lower-case letters first.)
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
|
||||
<term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
|
||||
<listitem>
|
||||
<para>
|
||||
Combines both of the above options.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3 id="icu-external-references">
|
||||
<title>External References for ICU</title>
|
||||
<para>
|
||||
This section (<xref linkend="icu-custom-collations"/>) is only a brief
|
||||
overview of ICU behavior and language tags. Refer to the following
|
||||
documents for technical details, additional options, and new behavior:
|
||||
</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
<ulink
|
||||
url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
|
||||
Technical Standard #35</ulink>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
|
||||
repository</ulink>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
<ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</sect3>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
<sect1 id="multibyte">
|
||||
|
Loading…
x
Reference in New Issue
Block a user