1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-28 23:42:10 +03:00

Per-column collation support

This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.

Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
This commit is contained in:
Peter Eisentraut
2011-02-08 23:04:18 +02:00
parent 1703f0e8da
commit 414c5a2ea6
156 changed files with 4519 additions and 582 deletions

View File

@ -103,6 +103,11 @@
<entry>check constraints, unique constraints, primary key constraints, foreign key constraints</entry>
</row>
<row>
<entry><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link></entry>
<entry>collations (locale information)</entry>
</row>
<row>
<entry><link linkend="catalog-pg-conversion"><structname>pg_conversion</structname></link></entry>
<entry>encoding conversion information</entry>
@ -1113,6 +1118,16 @@
</entry>
</row>
<row>
<entry><structfield>attcollation</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
<entry>
The defined collation of the column, zero if the column does
not have a collatable type.
</entry>
</row>
<row>
<entry><structfield>attacl</structfield></entry>
<entry><type>aclitem[]</type></entry>
@ -2050,6 +2065,76 @@
</sect1>
<sect1 id="catalog-pg-collation">
<title><structname>pg_collation</structname></title>
<indexterm zone="catalog-pg-collation">
<primary>pg_collation</primary>
</indexterm>
<para>
The catalog <structname>pg_collation</structname> describes the
available collations, which are essentially mappings from an SQL
name to operating system locale categories.
See <xref linkend="locale"> for more information.
</para>
<table>
<title><structname>pg_collation</> Columns</title>
<tgroup cols="4">
<thead>
<row>
<entry>Name</entry>
<entry>Type</entry>
<entry>References</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry><structfield>collname</structfield></entry>
<entry><type>name</type></entry>
<entry></entry>
<entry>Collation name (unique per namespace and encoding)</entry>
</row>
<row>
<entry><structfield>collnamespace</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-namespace"><structname>pg_namespace</structname></link>.oid</literal></entry>
<entry>
The OID of the namespace that contains this collation
</entry>
</row>
<row>
<entry><structfield>collencoding</structfield></entry>
<entry><type>int4</type></entry>
<entry></entry>
<entry>Encoding to which the collation is applicable</entry>
</row>
<row>
<entry><structfield>collcollate</structfield></entry>
<entry><type>name</type></entry>
<entry></entry>
<entry>LC_COLLATE for this collation object</entry>
</row>
<row>
<entry><structfield>collctype</structfield></entry>
<entry><type>name</type></entry>
<entry></entry>
<entry>LC_CTYPE for this collation object</entry>
</row>
</tbody>
</tgroup>
</table>
</sect1>
<sect1 id="catalog-pg-conversion">
<title><structname>pg_conversion</structname></title>
@ -3125,6 +3210,16 @@
</entry>
</row>
<row>
<entry><structfield>indcollation</structfield></entry>
<entry><type>oidvector</type></entry>
<entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
<entry>
For each column in the index key, this contains the OID of the
collation to use for the index.
</entry>
</row>
<row>
<entry><structfield>indclass</structfield></entry>
<entry><type>oidvector</type></entry>
@ -5866,6 +5961,21 @@
</para></entry>
</row>
<row>
<entry><structfield>typcollation</structfield></entry>
<entry><type>oid</type></entry>
<entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
<entry><para>
<structfield>typcollation</structfield> specifies the collation
of the type. If a type does not support collations, this will
be zero, collation analysis at parse time is skipped, and
the use of <literal>COLLATE</literal> clauses with the type is
invalid. A base type that supports collations will have
<symbol>DEFAULT_COLLATION_OID</symbol> here. A domain can have
another collation OID, if one was defined for the domain.
</para></entry>
</row>
<row>
<entry><structfield>typdefaultbin</structfield></entry>
<entry><type>pg_node_tree</type></entry>

View File

@ -304,6 +304,170 @@ initdb --locale=sv_SE
</sect1>
<sect1 id="collation">
<title>Collation Support</title>
<para>
The collation support allows specifying the sort order and certain
other locale aspects of data per column or per operation at run
time. This alleviates the problem that the
<symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol> settings
of a database cannot be changed after its creation.
</para>
<note>
<para>
The collation support feature is currently only known to work on
Linux/glibc and Mac OS X platforms.
</para>
</note>
<sect2>
<title>Concepts</title>
<para>
Conceptually, every datum of a collatable data type has a
collation. (Collatable data types in the base system are
<type>text</type>, <type>varchar</type>, and <type>char</type>.
User-defined base types can also be marked collatable.) If the
datum is a column reference, the collation of the datum is the
defined collation of the column. If the datum is a constant, the
collation is the default collation of the data type of the
constant. The collation of more complex expressions is derived
from the input collations as described below.
</para>
<para>
The collation of a datum can also be the <quote>default</quote>
collation, which reverts to the locale settings defined for the
database. In some cases, a datum can also have no known
collation. In such cases, ordering operations and other
operations that need to know the collation will fail.
</para>
<para>
When the database system has to perform an ordering or a
comparison, it considers the collation of the input data. This
happens in two situations: an <literal>ORDER BY</literal> clause
and a function or operator call such as <literal>&lt;</literal>.
The collation to apply for the performance of the <literal>ORDER
BY</literal> clause is simply the collation of the sort key. The
collation to apply for a function or operator call is derived from
the arguments, as described below. Additionally, collations are
taken into account by functions that convert between lower and
upper case letters, that is, <function>lower</function>,
<function>upper</function>, and <function>initcap</function>.
</para>
<para>
For a function call, the collation that is derived from combining
the argument collations is both used for performing any
comparisons or ordering and for the collation of the function
result, if the result type is collatable.
</para>
<para>
The <firstterm>collation derivation</firstterm> of a datum can be
implicit or explicit. This distinction affects how collations are
combined when multiple different collations appear in an
expression. An explicit collation derivation arises when a
<literal>COLLATE</literal> clause is used; all other collation
derivations are implicit. When multiple collations need to be
combined, for example in a function call, the following rules are
used:
<orderedlist>
<listitem>
<para>
If any input item has an explicit collation derivation, then
all explicitly derived collations among the input items must be
the same, otherwise an error is raised. If an explicitly
derived collation is present, that is the result of the
collation combination.
</para>
</listitem>
<listitem>
<para>
Otherwise, all input items must have the same implicit
collation derivation or the default collation. If an
implicitly derived collation is present, that is the result of
the collation combination. Otherwise, the result is the
default collation.
</para>
</listitem>
</orderedlist>
For example, take this table definition:
<programlisting>
CREATE TABLE test1 (
a text COLLATE "x",
...
);
</programlisting>
Then in
<programlisting>
SELECT a || 'foo' FROM test1;
</programlisting>
the result collation of the <literal>||</literal> operator is
<literal>"x"</literal> because it combines an implicitly derived
collation with the default collation. But in
<programlisting>
SELECT a || ('foo' COLLATE "y") FROM test1;
</programlisting>
the result collation is <literal>"y"</literal> because the explicit
collation derivation overrides the implicit one.
</para>
</sect2>
<sect2>
<title>Managing Collations</title>
<para>
A collation is an SQL schema object that maps an SQL name to
operating system locales. In particular, it maps to a combination
of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>. (As
the name would indicate, the main purpose of a collation is to set
<symbol>LC_COLLATE</symbol>, which controls the sort order. But
it is rarely necessary in practice to have an
<symbol>LC_CTYPE</symbol> setting that is different from
<symbol>LC_COLLATE</symbol>, so it is more convenient to collect
these under one concept than to create another infrastructure for
setting <symbol>LC_CTYPE</symbol> per datum.) Also, a collation
is tied to a character encoding. The same collation name may
exist for different encodings.
</para>
<para>
When a database system is initialized, <command>initdb</command>
populates the system catalog <literal>pg_collation</literal> with
collations based on all the locales it finds on the operating
system at the time. For example, the operating system might
provide a locale named <literal>de_DE.utf8</literal>.
<command>initdb</command> would then create a collation named
<literal>de_DE.utf8</literal> for encoding <literal>UTF8</literal>
that has both <symbol>LC_COLLATE</symbol> and
<symbol>LC_CTYPE</symbol> set to <literal>de_DE.utf8</literal>.
It will also create a collation with the <literal>.utf8</literal>
tag stripped off the name. So you could also use the collation
under the name <literal>de_DE</literal>, which is less cumbersome
to write and makes the name less encoding-dependent. Note that,
nevertheless, the initial set of collation names is
platform-dependent.
</para>
<para>
In case a collation is needed that has different values for
<symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or a
different name is needed for a collation (for example, for
compatibility with existing applications), a new collation may be
created. But there is currently no SQL-level support for creating
or changing collations.
</para>
</sect2>
</sect1>
<sect1 id="multibyte">
<title>Character Set Support</title>

View File

@ -13059,6 +13059,12 @@ SELECT relname FROM pg_class WHERE pg_table_is_visible(oid);
</thead>
<tbody>
<row>
<entry><literal><function>pg_collation_is_visible(<parameter>collation_oid</parameter>)</function></literal>
</entry>
<entry><type>boolean</type></entry>
<entry>is collation visible in search path</entry>
</row>
<row>
<entry><literal><function>pg_conversion_is_visible(<parameter>conversion_oid</parameter>)</function></literal>
</entry>
@ -13123,6 +13129,9 @@ SELECT relname FROM pg_class WHERE pg_table_is_visible(oid);
</tgroup>
</table>
<indexterm>
<primary>pg_collation_is_visible</primary>
</indexterm>
<indexterm>
<primary>pg_conversion_is_visible</primary>
</indexterm>
@ -13256,7 +13265,7 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
<tbody>
<row>
<entry><literal><function>format_type(<parameter>type_oid</parameter>, <parameter>typemod</>)</function></literal></entry>
<entry><literal><function>format_type(<parameter>type_oid</parameter> [, <parameter>typemod</> [, <parameter>collation_oid</> ]])</function></literal></entry>
<entry><type>text</type></entry>
<entry>get SQL name of a data type</entry>
</row>
@ -13392,7 +13401,9 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
<para>
<function>format_type</function> returns the SQL name of a data type that
is identified by its type OID and possibly a type modifier. Pass NULL
for the type modifier if no specific modifier is known.
for the type modifier or omit the argument if no specific modifier is known.
If a collation is given as third argument, a <literal>COLLATE</> clause
followed by a formatted collation name is appended.
</para>
<para>

View File

@ -921,6 +921,7 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
defining two operator classes for the data type and then selecting
the proper class when making an index. The operator class determines
the basic sort ordering (which can then be modified by adding sort options
<literal>COLLATE</literal>,
<literal>ASC</>/<literal>DESC</> and/or
<literal>NULLS FIRST</>/<literal>NULLS LAST</>).
</para>
@ -1002,6 +1003,47 @@ SELECT am.amname AS index_method,
</sect1>
<sect1 id="indexes-collations">
<title>Collations and Indexes</title>
<para>
An index can only support one collation for one column or
expression. If multiple collations are of interest, multiple
indexes may be created.
</para>
<para>
Consider these statements:
<programlisting>
CREATE TABLE test1c (
id integer,
content varchar COLLATE "x"
);
CREATE INDEX test1c_content_index ON test1c (content);
</programlisting>
The created index automatically follows the collation of the
underlying column, and so a query of the form
<programlisting>
SELECT * FROM test1c WHERE content = <replaceable>constant</replaceable>;
</programlisting>
could use the index.
</para>
<para>
If in addition, a query of the form, say,
<programlisting>
SELECT * FROM test1c WHERE content &gt; <replaceable>constant</replaceable> COLLATE "y";
</programlisting>
is of interest, an additional index could be created that supports
the <literal>"y"</literal> collation, like so:
<programlisting>
CREATE INDEX test1c_content_index ON test1c (content COLLATE "y");
</programlisting>
</para>
</sect1>
<sect1 id="indexes-examine">
<title>Examining Index Usage</title>

View File

@ -22,6 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
CREATE DOMAIN <replaceable class="parameter">name</replaceable> [ AS ] <replaceable class="parameter">data_type</replaceable>
[ COLLATE <replaceable>collation</replaceable> ]
[ DEFAULT <replaceable>expression</replaceable> ]
[ <replaceable class="PARAMETER">constraint</replaceable> [ ... ] ]
@ -83,6 +84,17 @@ CREATE DOMAIN <replaceable class="parameter">name</replaceable> [ AS ] <replacea
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable>collation</replaceable></term>
<listitem>
<para>
An optional collation for the domain. If no collation is
specified, the database default collation is used (which can
be overridden when the domain is used to define a column).
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>DEFAULT <replaceable>expression</replaceable></literal></term>

View File

@ -22,7 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</replaceable> ] ON <replaceable class="parameter">table</replaceable> [ USING <replaceable class="parameter">method</replaceable> ]
( { <replaceable class="parameter">column</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ <replaceable class="parameter">opclass</replaceable> ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
( { <replaceable class="parameter">column</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ COLLATE <replaceable class="parameter">collation</replaceable> ] [ <replaceable class="parameter">opclass</replaceable> ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH ( <replaceable class="PARAMETER">storage_parameter</replaceable> = <replaceable class="PARAMETER">value</replaceable> [, ... ] ) ]
[ TABLESPACE <replaceable class="parameter">tablespace</replaceable> ]
[ WHERE <replaceable class="parameter">predicate</replaceable> ]
@ -181,6 +181,20 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable class="parameter">collation</replaceable></term>
<listitem>
<para>
The name of the collation to use for the index. By default,
the index uses the collation declared for the column to be
indexed or the result collation of the expression to be
indexed. Indexes with nondefault collations are
available for use by queries that involve expressions using
nondefault collations.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><replaceable class="parameter">opclass</replaceable></term>
<listitem>

View File

@ -22,7 +22,7 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXISTS ] <replaceable class="PARAMETER">table_name</replaceable> ( [
{ <replaceable class="PARAMETER">column_name</replaceable> <replaceable class="PARAMETER">data_type</replaceable> [ <replaceable class="PARAMETER">column_constraint</replaceable> [ ... ] ]
{ <replaceable class="PARAMETER">column_name</replaceable> <replaceable class="PARAMETER">data_type</replaceable> [ COLLATE <replaceable>collation</replaceable> ] [ <replaceable class="PARAMETER">column_constraint</replaceable> [ ... ] ]
| <replaceable>table_constraint</replaceable>
| LIKE <replaceable>parent_table</replaceable> [ <replaceable>like_option</replaceable> ... ] }
[, ... ]
@ -244,6 +244,17 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
</listitem>
</varlistentry>
<varlistentry>
<term><literal>COLLATE <replaceable>collation</replaceable></literal></term>
<listitem>
<para>
The <literal>COLLATE</> clause assigns a nondefault collation to
the column. By default, the locale settings of the database are
used.
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><literal>INHERITS ( <replaceable>parent_table</replaceable> [, ... ] )</literal></term>
<listitem>

View File

@ -45,6 +45,7 @@ CREATE TYPE <replaceable class="parameter">name</replaceable> (
[ , DEFAULT = <replaceable class="parameter">default</replaceable> ]
[ , ELEMENT = <replaceable class="parameter">element</replaceable> ]
[ , DELIMITER = <replaceable class="parameter">delimiter</replaceable> ]
[ , COLLATABLE = <replaceable class="parameter">collatable</replaceable> ]
)
CREATE TYPE <replaceable class="parameter">name</replaceable>
@ -352,6 +353,16 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
with the array element type, not the array type itself.
</para>
<para>
If the optional
parameter <replaceable class="parameter">collatable</replaceable>
is true, column definitions and expressions of the type may carry
collation information and allow the use of
the <literal>COLLATE</literal> clause. It is up to the
implementations of the functions operating on the type to actually
make use of the collation information; this does not happen
automatically merely by marking the type collatable.
</para>
</refsect2>
<refsect2>

View File

@ -236,6 +236,27 @@ gmake check LANG=C MULTIBYTE=EUC_JP
existing installation.
</para>
</sect2>
<sect2>
<title>Extra tests</title>
<para>
The regression test suite contains a few test files that are not
run by default, because they might be platform-dependent or take a
very long time to run. You can run these or other extra test
files by setting the variable <envar>EXTRA_TESTS</envar>. For
example, to run the <literal>numeric_big</literal> test:
<screen>
gmake check EXTRA_TESTS=numeric_big
</screen>
To run the collation tests:
<screen>
gmake check EXTRA_TESTS=collate.linux.utf8 LANG=en_US.utf8
</screen>
This test works only on Linux/glibc platforms and when run in a
UTF-8 locale.
</para>
</sect2>
</sect1>
<sect1 id="regress-evaluation">

View File

@ -1899,6 +1899,54 @@ CAST ( <replaceable>expression</replaceable> AS <replaceable>type</replaceable>
</note>
</sect2>
<sect2 id="sql-syntax-collate-clause">
<title>COLLATE Clause</title>
<indexterm>
<primary>COLLATE</primary>
</indexterm>
<para>
The <literal>COLLATE</literal> clause overrides the collation of
an expression. It is appended to the expression it applies to:
<synopsis>
<replaceable>expr</replaceable> COLLATE <replaceable>collation</replaceable>
</synopsis>
where <replaceable>collation</replaceable> is a possibly
schema-qualified identifier. The <literal>COLLATE</literal>
clause binds tighter than operators; parentheses can be used when
necessary.
</para>
<para>
If no collation is explicitly specified, the database system
either derives a collation from the columns involved in the
expression, or it defaults to the default collation of the
database if no column is involved in the expression.
</para>
<para>
The two typical uses of the <literal>COLLATE</literal> clause are
overriding the sort order in an <literal>ORDER BY</> clause, for
example:
<programlisting>
SELECT a, b, c FROM tbl WHERE ... ORDER BY a COLLATE "C";
</programlisting>
and overriding the collation of a function or operator call that
has locale-sensitive results, for example:
<programlisting>
SELECT * FROM tbl WHERE a > 'foo' COLLATE "C";
</programlisting>
In the latter case it doesn't matter which argument of the
operator of function call the <literal>COLLATE</> clause is
attached to, because the collation that is applied by the operator
or function is derived from all arguments, and
the <literal>COLLATE</> clause will override the collations of all
other arguments. Attaching nonmatching <literal>COLLATE</>
clauses to more than one argument, however, is an error.
</para>
</sect2>
<sect2 id="sql-syntax-scalar-subqueries">
<title>Scalar Subqueries</title>