Per-column collation support

This adds collation support for columns and domains, a COLLATE clause to override it per expression, and B-tree index support. Peter Eisentraut reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
2025-07-28 23:42:10 +03:00 · 2011-02-08 23:04:18 +02:00
parent 1703f0e8da
commit 414c5a2ea6
156 changed files with 4519 additions and 582 deletions
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@ -103,6 +103,11 @@
      <entry>check constraints, unique constraints, primary key constraints, foreign key constraints</entry>
     </row>

+     <row>
+      <entry><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link></entry>
+      <entry>collations (locale information)</entry>
+     </row>
+
     <row>
      <entry><link linkend="catalog-pg-conversion"><structname>pg_conversion</structname></link></entry>
      <entry>encoding conversion information</entry>
@ -1113,6 +1118,16 @@
      </entry>
     </row>

+     <row>
+      <entry><structfield>attcollation</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
+      <entry>
+       The defined collation of the column, zero if the column does
+       not have a collatable type.
+      </entry>
+     </row>
+
     <row>
      <entry><structfield>attacl</structfield></entry>
      <entry><type>aclitem[]</type></entry>
@ -2050,6 +2065,76 @@

 </sect1>

+ <sect1 id="catalog-pg-collation">
+  <title><structname>pg_collation</structname></title>
+
+  <indexterm zone="catalog-pg-collation">
+   <primary>pg_collation</primary>
+  </indexterm>
+
+  <para>
+   The catalog <structname>pg_collation</structname> describes the
+   available collations, which are essentially mappings from an SQL
+   name to operating system locale categories.
+   See <xref linkend="locale"> for more information.
+  </para>
+
+  <table>
+   <title><structname>pg_collation</> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry><structfield>collname</structfield></entry>
+      <entry><type>name</type></entry>
+      <entry></entry>
+      <entry>Collation name (unique per namespace and encoding)</entry>
+     </row>
+
+     <row>
+      <entry><structfield>collnamespace</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-namespace"><structname>pg_namespace</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the namespace that contains this collation
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>collencoding</structfield></entry>
+      <entry><type>int4</type></entry>
+      <entry></entry>
+      <entry>Encoding to which the collation is applicable</entry>
+     </row>
+
+     <row>
+      <entry><structfield>collcollate</structfield></entry>
+      <entry><type>name</type></entry>
+      <entry></entry>
+      <entry>LC_COLLATE for this collation object</entry>
+     </row>
+
+     <row>
+      <entry><structfield>collctype</structfield></entry>
+      <entry><type>name</type></entry>
+      <entry></entry>
+      <entry>LC_CTYPE for this collation object</entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect1>
+
 <sect1 id="catalog-pg-conversion">
  <title><structname>pg_conversion</structname></title>

@ -3125,6 +3210,16 @@
      </entry>
     </row>

+     <row>
+      <entry><structfield>indcollation</structfield></entry>
+      <entry><type>oidvector</type></entry>
+      <entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
+      <entry>
+       For each column in the index key, this contains the OID of the
+       collation to use for the index.
+      </entry>
+     </row>
+
     <row>
      <entry><structfield>indclass</structfield></entry>
      <entry><type>oidvector</type></entry>
@ -5866,6 +5961,21 @@
       </para></entry>
     </row>

+     <row>
+      <entry><structfield>typcollation</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-collation"><structname>pg_collation</structname></link>.oid</literal></entry>
+      <entry><para>
+       <structfield>typcollation</structfield> specifies the collation
+       of the type.  If a type does not support collations, this will
+       be zero, collation analysis at parse time is skipped, and
+       the use of <literal>COLLATE</literal> clauses with the type is
+       invalid.  A base type that supports collations will have
+       <symbol>DEFAULT_COLLATION_OID</symbol> here.  A domain can have
+       another collation OID, if one was defined for the domain.
+      </para></entry>
+     </row>
+
     <row>
      <entry><structfield>typdefaultbin</structfield></entry>
      <entry><type>pg_node_tree</type></entry>
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@ -304,6 +304,170 @@ initdb --locale=sv_SE
 </sect1>


+ <sect1 id="collation">
+  <title>Collation Support</title>
+
+  <para>
+   The collation support allows specifying the sort order and certain
+   other locale aspects of data per column or per operation at run
+   time.  This alleviates the problem that the
+   <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol> settings
+   of a database cannot be changed after its creation.
+  </para>
+
+  <note>
+   <para>
+    The collation support feature is currently only known to work on
+    Linux/glibc and Mac OS X platforms.
+   </para>
+  </note>
+
+  <sect2>
+   <title>Concepts</title>
+
+   <para>
+    Conceptually, every datum of a collatable data type has a
+    collation.  (Collatable data types in the base system are
+    <type>text</type>, <type>varchar</type>, and <type>char</type>.
+    User-defined base types can also be marked collatable.)  If the
+    datum is a column reference, the collation of the datum is the
+    defined collation of the column.  If the datum is a constant, the
+    collation is the default collation of the data type of the
+    constant.  The collation of more complex expressions is derived
+    from the input collations as described below.
+   </para>
+
+   <para>
+    The collation of a datum can also be the <quote>default</quote>
+    collation, which reverts to the locale settings defined for the
+    database.  In some cases, a datum can also have no known
+    collation.  In such cases, ordering operations and other
+    operations that need to know the collation will fail.
+   </para>
+
+   <para>
+    When the database system has to perform an ordering or a
+    comparison, it considers the collation of the input data.  This
+    happens in two situations: an <literal>ORDER BY</literal> clause
+    and a function or operator call such as <literal>&lt;</literal>.
+    The collation to apply for the performance of the <literal>ORDER
+    BY</literal> clause is simply the collation of the sort key.  The
+    collation to apply for a function or operator call is derived from
+    the arguments, as described below.  Additionally, collations are
+    taken into account by functions that convert between lower and
+    upper case letters, that is, <function>lower</function>,
+    <function>upper</function>, and <function>initcap</function>.
+   </para>
+
+   <para>
+    For a function call, the collation that is derived from combining
+    the argument collations is both used for performing any
+    comparisons or ordering and for the collation of the function
+    result, if the result type is collatable.
+   </para>
+
+   <para>
+    The <firstterm>collation derivation</firstterm> of a datum can be
+    implicit or explicit.  This distinction affects how collations are
+    combined when multiple different collations appear in an
+    expression.  An explicit collation derivation arises when a
+    <literal>COLLATE</literal> clause is used; all other collation
+    derivations are implicit.  When multiple collations need to be
+    combined, for example in a function call, the following rules are
+    used:
+
+    <orderedlist>
+     <listitem>
+      <para>
+       If any input item has an explicit collation derivation, then
+       all explicitly derived collations among the input items must be
+       the same, otherwise an error is raised.  If an explicitly
+       derived collation is present, that is the result of the
+       collation combination.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       Otherwise, all input items must have the same implicit
+       collation derivation or the default collation.  If an
+       implicitly derived collation is present, that is the result of
+       the collation combination.  Otherwise, the result is the
+       default collation.
+      </para>
+     </listitem>
+    </orderedlist>
+
+    For example, take this table definition:
+<programlisting>
+CREATE TABLE test1 (
+    a text COLLATE "x",
+    ...
+);
+</programlisting>
+
+    Then in
+<programlisting>
+SELECT a || 'foo' FROM test1;
+</programlisting>
+    the result collation of the <literal>||</literal> operator is
+    <literal>"x"</literal> because it combines an implicitly derived
+    collation with the default collation.  But in
+<programlisting>
+SELECT a || ('foo' COLLATE "y") FROM test1;
+</programlisting>
+    the result collation is <literal>"y"</literal> because the explicit
+    collation derivation overrides the implicit one.
+   </para>
+  </sect2>
+
+  <sect2>
+   <title>Managing Collations</title>
+
+   <para>
+    A collation is an SQL schema object that maps an SQL name to
+    operating system locales.  In particular, it maps to a combination
+    of <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>.  (As
+    the name would indicate, the main purpose of a collation is to set
+    <symbol>LC_COLLATE</symbol>, which controls the sort order.  But
+    it is rarely necessary in practice to have an
+    <symbol>LC_CTYPE</symbol> setting that is different from
+    <symbol>LC_COLLATE</symbol>, so it is more convenient to collect
+    these under one concept than to create another infrastructure for
+    setting <symbol>LC_CTYPE</symbol> per datum.)  Also, a collation
+    is tied to a character encoding.  The same collation name may
+    exist for different encodings.
+   </para>
+
+   <para>
+    When a database system is initialized, <command>initdb</command>
+    populates the system catalog <literal>pg_collation</literal> with
+    collations based on all the locales it finds on the operating
+    system at the time.  For example, the operating system might
+    provide a locale named <literal>de_DE.utf8</literal>.
+    <command>initdb</command> would then create a collation named
+    <literal>de_DE.utf8</literal> for encoding <literal>UTF8</literal>
+    that has both <symbol>LC_COLLATE</symbol> and
+    <symbol>LC_CTYPE</symbol> set to <literal>de_DE.utf8</literal>.
+    It will also create a collation with the <literal>.utf8</literal>
+    tag stripped off the name.  So you could also use the collation
+    under the name <literal>de_DE</literal>, which is less cumbersome
+    to write and makes the name less encoding-dependent.  Note that,
+    nevertheless, the initial set of collation names is
+    platform-dependent.
+   </para>
+
+   <para>
+    In case a collation is needed that has different values for
+    <symbol>LC_COLLATE</symbol> and <symbol>LC_CTYPE</symbol>, or a
+    different name is needed for a collation (for example, for
+    compatibility with existing applications), a new collation may be
+    created.  But there is currently no SQL-level support for creating
+    or changing collations.
+   </para>
+  </sect2>
+ </sect1>
+
 <sect1 id="multibyte">
  <title>Character Set Support</title>

--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@ -13059,6 +13059,12 @@ SELECT relname FROM pg_class WHERE pg_table_is_visible(oid);
     </thead>

     <tbody>
+      <row>
+       <entry><literal><function>pg_collation_is_visible(<parameter>collation_oid</parameter>)</function></literal>
+       </entry>
+       <entry><type>boolean</type></entry>
+       <entry>is collation visible in search path</entry>
+      </row>
      <row>
       <entry><literal><function>pg_conversion_is_visible(<parameter>conversion_oid</parameter>)</function></literal>
       </entry>
@ -13123,6 +13129,9 @@ SELECT relname FROM pg_class WHERE pg_table_is_visible(oid);
    </tgroup>
   </table>

+   <indexterm>
+    <primary>pg_collation_is_visible</primary>
+   </indexterm>
   <indexterm>
    <primary>pg_conversion_is_visible</primary>
   </indexterm>
@ -13256,7 +13265,7 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

     <tbody>
      <row>
-       <entry><literal><function>format_type(<parameter>type_oid</parameter>, <parameter>typemod</>)</function></literal></entry>
+       <entry><literal><function>format_type(<parameter>type_oid</parameter> [, <parameter>typemod</> [, <parameter>collation_oid</> ]])</function></literal></entry>
       <entry><type>text</type></entry>
       <entry>get SQL name of a data type</entry>
      </row>
@ -13392,7 +13401,9 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
  <para>
   <function>format_type</function> returns the SQL name of a data type that
   is identified by its type OID and possibly a type modifier.  Pass NULL
-   for the type modifier if no specific modifier is known.
+   for the type modifier or omit the argument if no specific modifier is known.
+   If a collation is given as third argument, a <literal>COLLATE</> clause
+   followed by a formatted collation name is appended.
  </para>

  <para>
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@ -921,6 +921,7 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
   defining two operator classes for the data type and then selecting
   the proper class when making an index.  The operator class determines
   the basic sort ordering (which can then be modified by adding sort options
+   <literal>COLLATE</literal>,
   <literal>ASC</>/<literal>DESC</> and/or
   <literal>NULLS FIRST</>/<literal>NULLS LAST</>).
  </para>
@ -1002,6 +1003,47 @@ SELECT am.amname AS index_method,
 </sect1>


+ <sect1 id="indexes-collations">
+  <title>Collations and Indexes</title>
+
+  <para>
+   An index can only support one collation for one column or
+   expression.  If multiple collations are of interest, multiple
+   indexes may be created.
+  </para>
+
+  <para>
+   Consider these statements:
+<programlisting>
+CREATE TABLE test1c (
+    id integer,
+    content varchar COLLATE "x"
+);
+
+CREATE INDEX test1c_content_index ON test1c (content);
+</programlisting>
+   The created index automatically follows the collation of the
+   underlying column, and so a query of the form
+<programlisting>
+SELECT * FROM test1c WHERE content = <replaceable>constant</replaceable>;
+</programlisting>
+   could use the index.
+  </para>
+
+  <para>
+   If in addition, a query of the form, say,
+<programlisting>
+SELECT * FROM test1c WHERE content &gt; <replaceable>constant</replaceable> COLLATE "y";
+</programlisting>
+   is of interest, an additional index could be created that supports
+   the <literal>"y"</literal> collation, like so:
+<programlisting>
+CREATE INDEX test1c_content_index ON test1c (content COLLATE "y");
+</programlisting>
+  </para>
+ </sect1>
+
+
 <sect1 id="indexes-examine">
  <title>Examining Index Usage</title>

--- a/doc/src/sgml/ref/create_domain.sgml
+++ b/doc/src/sgml/ref/create_domain.sgml
@ -22,6 +22,7 @@ PostgreSQL documentation
 <refsynopsisdiv>
 <synopsis>
 CREATE DOMAIN <replaceable class="parameter">name</replaceable> [ AS ] <replaceable class="parameter">data_type</replaceable>
+    [ COLLATE <replaceable>collation</replaceable> ]
    [ DEFAULT <replaceable>expression</replaceable> ]
    [ <replaceable class="PARAMETER">constraint</replaceable> [ ... ] ]

@ -83,6 +84,17 @@ CREATE DOMAIN <replaceable class="parameter">name</replaceable> [ AS ] <replacea
      </listitem>
     </varlistentry>

+     <varlistentry>
+      <term><replaceable>collation</replaceable></term>
+      <listitem>
+       <para>
+        An optional collation for the domain.  If no collation is
+        specified, the database default collation is used (which can
+        be overridden when the domain is used to define a column).
+       </para>
+      </listitem>
+     </varlistentry>
+
     <varlistentry>
      <term><literal>DEFAULT <replaceable>expression</replaceable></literal></term>

--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@ -22,7 +22,7 @@ PostgreSQL documentation
 <refsynopsisdiv>
 <synopsis>
 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</replaceable> ] ON <replaceable class="parameter">table</replaceable> [ USING <replaceable class="parameter">method</replaceable> ]
-    ( { <replaceable class="parameter">column</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ <replaceable class="parameter">opclass</replaceable> ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
+    ( { <replaceable class="parameter">column</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ COLLATE <replaceable class="parameter">collation</replaceable> ] [ <replaceable class="parameter">opclass</replaceable> ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
    [ WITH ( <replaceable class="PARAMETER">storage_parameter</replaceable> = <replaceable class="PARAMETER">value</replaceable> [, ... ] ) ]
    [ TABLESPACE <replaceable class="parameter">tablespace</replaceable> ]
    [ WHERE <replaceable class="parameter">predicate</replaceable> ]
@ -181,6 +181,20 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ <replaceable class="parameter">name</
      </listitem>
     </varlistentry>

+     <varlistentry>
+      <term><replaceable class="parameter">collation</replaceable></term>
+      <listitem>
+       <para>
+        The name of the collation to use for the index.  By default,
+        the index uses the collation declared for the column to be
+        indexed or the result collation of the expression to be
+        indexed.  Indexes with nondefault collations are
+        available for use by queries that involve expressions using
+        nondefault collations.
+       </para>
+      </listitem>
+     </varlistentry>
+
     <varlistentry>
      <term><replaceable class="parameter">opclass</replaceable></term>
      <listitem>
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@ -22,7 +22,7 @@ PostgreSQL documentation
 <refsynopsisdiv>
 <synopsis>
 CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXISTS ] <replaceable class="PARAMETER">table_name</replaceable> ( [
-  { <replaceable class="PARAMETER">column_name</replaceable> <replaceable class="PARAMETER">data_type</replaceable> [ <replaceable class="PARAMETER">column_constraint</replaceable> [ ... ] ]
+  { <replaceable class="PARAMETER">column_name</replaceable> <replaceable class="PARAMETER">data_type</replaceable> [ COLLATE <replaceable>collation</replaceable> ] [ <replaceable class="PARAMETER">column_constraint</replaceable> [ ... ] ]
    | <replaceable>table_constraint</replaceable>
    | LIKE <replaceable>parent_table</replaceable> [ <replaceable>like_option</replaceable> ... ] }
    [, ... ]
@ -244,6 +244,17 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
    </listitem>
   </varlistentry>

+   <varlistentry>
+    <term><literal>COLLATE <replaceable>collation</replaceable></literal></term>
+    <listitem>
+     <para>
+      The <literal>COLLATE</> clause assigns a nondefault collation to
+      the column.  By default, the locale settings of the database are
+      used.
+     </para>
+    </listitem>
+   </varlistentry>
+
   <varlistentry>
    <term><literal>INHERITS ( <replaceable>parent_table</replaceable> [, ... ] )</literal></term>
    <listitem>
--- a/doc/src/sgml/ref/create_type.sgml
+++ b/doc/src/sgml/ref/create_type.sgml
@ -45,6 +45,7 @@ CREATE TYPE <replaceable class="parameter">name</replaceable> (
    [ , DEFAULT = <replaceable class="parameter">default</replaceable> ]
    [ , ELEMENT = <replaceable class="parameter">element</replaceable> ]
    [ , DELIMITER = <replaceable class="parameter">delimiter</replaceable> ]
+    [ , COLLATABLE = <replaceable class="parameter">collatable</replaceable> ]
 )

 CREATE TYPE <replaceable class="parameter">name</replaceable>
@ -352,6 +353,16 @@ CREATE TYPE <replaceable class="parameter">name</replaceable>
   with the array element type, not the array type itself.
  </para>

+  <para>
+   If the optional
+   parameter <replaceable class="parameter">collatable</replaceable>
+   is true, column definitions and expressions of the type may carry
+   collation information and allow the use of
+   the <literal>COLLATE</literal> clause.  It is up to the
+   implementations of the functions operating on the type to actually
+   make use of the collation information; this does not happen
+   automatically merely by marking the type collatable.
+  </para>
  </refsect2>

  <refsect2>
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@ -236,6 +236,27 @@ gmake check LANG=C MULTIBYTE=EUC_JP
    existing installation.
   </para>
  </sect2>
+
+  <sect2>
+   <title>Extra tests</title>
+
+   <para>
+    The regression test suite contains a few test files that are not
+    run by default, because they might be platform-dependent or take a
+    very long time to run.  You can run these or other extra test
+    files by setting the variable <envar>EXTRA_TESTS</envar>.  For
+    example, to run the <literal>numeric_big</literal> test:
+<screen>
+gmake check EXTRA_TESTS=numeric_big
+</screen>
+    To run the collation tests:
+<screen>
+gmake check EXTRA_TESTS=collate.linux.utf8 LANG=en_US.utf8
+</screen>
+    This test works only on Linux/glibc platforms and when run in a
+    UTF-8 locale.
+   </para>
+  </sect2>
  </sect1>

  <sect1 id="regress-evaluation">
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@ -1899,6 +1899,54 @@ CAST ( <replaceable>expression</replaceable> AS <replaceable>type</replaceable>
   </note>
  </sect2>

+  <sect2 id="sql-syntax-collate-clause">
+   <title>COLLATE Clause</title>
+
+   <indexterm>
+    <primary>COLLATE</primary>
+   </indexterm>
+
+   <para>
+    The <literal>COLLATE</literal> clause overrides the collation of
+    an expression.  It is appended to the expression it applies to:
+<synopsis>
+<replaceable>expr</replaceable> COLLATE <replaceable>collation</replaceable>
+</synopsis>
+    where <replaceable>collation</replaceable> is a possibly
+    schema-qualified identifier.  The <literal>COLLATE</literal>
+    clause binds tighter than operators; parentheses can be used when
+    necessary.
+   </para>
+
+   <para>
+    If no collation is explicitly specified, the database system
+    either derives a collation from the columns involved in the
+    expression, or it defaults to the default collation of the
+    database if no column is involved in the expression.
+   </para>
+
+   <para>
+    The two typical uses of the <literal>COLLATE</literal> clause are
+    overriding the sort order in an <literal>ORDER BY</> clause, for
+    example:
+<programlisting>
+SELECT a, b, c FROM tbl WHERE ... ORDER BY a COLLATE "C";
+</programlisting>
+    and overriding the collation of a function or operator call that
+    has locale-sensitive results, for example:
+<programlisting>
+SELECT * FROM tbl WHERE a > 'foo' COLLATE "C";
+</programlisting>
+    In the latter case it doesn't matter which argument of the
+    operator of function call the <literal>COLLATE</> clause is
+    attached to, because the collation that is applied by the operator
+    or function is derived from all arguments, and
+    the <literal>COLLATE</> clause will override the collations of all
+    other arguments.  Attaching nonmatching <literal>COLLATE</>
+    clauses to more than one argument, however, is an error.
+   </para>
+  </sect2>
+
  <sect2 id="sql-syntax-scalar-subqueries">
   <title>Scalar Subqueries</title>