1
0
mirror of https://github.com/postgres/postgres.git synced 2025-08-28 18:48:04 +03:00
Commit Graph

25 Commits

Author SHA1 Message Date
Jeff Davis
d3d0983169 Support PG_UNICODE_FAST locale in the builtin collation provider.
The PG_UNICODE_FAST locale uses code point sort order (fast,
memcmp-based) combined with Unicode character semantics. The character
semantics are based on Unicode full case mapping.

Full case mapping can map a single codepoint to multiple codepoints,
such as "ß" uppercasing to "SS". Additionally, it handles
context-sensitive mappings like the "final sigma", and it uses
titlecase mappings such as "Dž" when titlecasing (rather than plain
uppercase mappings).

Importantly, the uppercasing of "ß" as "SS" is specifically mentioned
by the SQL standard. In Postgres, UCS_BASIC uses plain ASCII semantics
for case mapping and pattern matching, so if we changed it to use the
PG_UNICODE_FAST locale, it would offer better compliance with the
standard. For now, though, do not change the behavior of UCS_BASIC.

Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com
Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org
Reviewed-by: Peter Eisentraut, Daniel Verite
2025-01-17 15:56:30 -08:00
Peter Eisentraut
4a65e1c212 doc: Fix description of deterministic flag of CREATE COLLATION
The documentation said that you need to pick a suitable LC_COLLATE
setting in addition to setting the DETERMINISTIC flag.  This would
have been correct if the libc provider supported nondeterministic
collations, but since it doesn't, you actually need to set the LOCALE
option.

Reviewed-by: Kashif Zeeshan <kashi.zeeshan@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a71023c2-0ae0-45ad-9688-cf3b93d0d65b%40eisentraut.org
2024-05-02 08:21:18 +02:00
Jeff Davis
f69319f2f1 Support C.UTF-8 locale in the new builtin collation provider.
The builtin C.UTF-8 locale has similar semantics to the libc locale of
the same name. That is, code point sort order (fast, memcmp-based)
combined with Unicode semantics for character operations such as
pattern matching, regular expressions, and
LOWER()/INITCAP()/UPPER(). The character semantics are based on
Unicode simple case mappings.

The builtin provider's C.UTF-8 offers several important advantages
over libc:

 * faster sorting -- benefits from additional optimizations such as
   abbreviated keys and varstrfastcmp_c
 * faster case conversion, e.g. LOWER(), at least compared with some
   libc implementations
 * available on all platforms with identical semantics, and the
   semantics are stable, testable, and documentable within a given
   Postgres major version

Being based on memcmp, the builtin C.UTF-8 locale does not offer
natural language sort order. But it is an improvement for most use
cases that might otherwise use libc's "C.UTF-8" locale, as well as
many use cases that use libc's "C" locale.

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
2024-03-19 15:24:41 -07:00
Jeff Davis
2d819a08a1 Introduce "builtin" collation provider.
New provider for collations, like "libc" or "icu", but without any
external dependency.

Initially, the only locale supported by the builtin provider is "C",
which is identical to the libc provider's "C" locale. The libc
provider's "C" locale has always been treated as a special case that
uses an internal implementation, without using libc at all -- so the
new builtin provider uses the same implementation.

The builtin provider's locale is independent of the server environment
variables LC_COLLATE and LC_CTYPE. Using the builtin provider, the
database collation locale can be "C" while LC_COLLATE and LC_CTYPE are
set to "en_US", which is impossible with the libc provider.

By offering a new builtin provider, it clarifies that the semantics of
a collation using this provider will never depend on libc, and makes
it easier to document the behavior.

Discussion: https://postgr.es/m/ab925f69-5f9d-f85e-b87c-bd2a44798659@joeconway.com
Discussion: https://postgr.es/m/dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
2024-03-13 23:33:44 -07:00
Peter Eisentraut
17ec2c5dfa doc: Add more ICU rules examples
In particular, add an example EBCDIC collation.

Author: Daniel Verite <daniel@manitou-mail.org>
Discussion: https://www.postgresql.org/message-id/flat/35cc1684-e516-4a01-a256-351632d47066@manitou-mail.org
2023-08-23 11:23:42 +02:00
Jeff Davis
a14e75eb0b CREATE DATABASE: make LOCALE apply to all collation providers.
For CREATE DATABASE, make LOCALE parameter apply regardless of the
provider used. Also affects initdb and createdb --locale arguments.

Previously, LOCALE (and --locale) only affected the database default
collation when using the libc provider.

Discussion: https://postgr.es/m/1a63084d-221e-4075-619e-6b3e590f673e@enterprisedb.com
Reviewed-by: Peter Eisentraut
2023-06-16 10:27:32 -07:00
Peter Eisentraut
cd42785974 doc: Better example for custom ICU rules
Use a more practical example, and also add some explanation.

Reported-by: Jeff Davis <pgsql@j-davis.com>
2023-03-10 09:25:03 +01:00
Peter Eisentraut
30a53b7929 Allow tailoring of ICU locales with custom rules
This exposes the ICU facility to add custom collation rules to a
standard collation.

New options are added to CREATE COLLATION, CREATE DATABASE, createdb,
and initdb to set the rules.

Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Discussion: https://www.postgresql.org/message-id/flat/821c71a4-6ef0-d366-9acf-bb8e367f739f@enterprisedb.com
2023-03-08 16:56:37 +01:00
Jeff Davis
a8a44828a2 Correct docs for the default locale_provider of a new database.
If the locale provider is not specified, it defaults to be the same as
the template from which it was created. Previously, the documentation
said the default was libc.

Also adjust wording of CREATE DATABASE and CREATE COLLATION docs to be
definite that there are exactly two possible collation providers.

Discussion: https://postgr.es/m/6befdaada61c046b67f3b269f7fa6f069a35803e.camel%40j-davis.com
Reviewed-by: Nathan Bossart
2023-02-13 17:16:13 -08:00
Thomas Munro
ec48314708 Revert per-index collation version tracking feature.
Design problems were discovered in the handling of composite types and
record types that would cause some relevant versions not to be recorded.
Misgivings were also expressed about the use of the pg_depend catalog
for this purpose.  We're out of time for this release so we'll revert
and try again.

Commits reverted:

1bf946bd: Doc: Document known problem with Windows collation versions.
cf002008: Remove no-longer-relevant test case.
ef387bed: Fix bogus collation-version-recording logic.
0fb0a050: Hide internal error for pg_collation_actual_version(<bad OID>).
ff942057: Suppress "warning: variable 'collcollate' set but not used".
d50e3b1f: Fix assertion in collation version lookup.
f24b1569: Rethink extraction of collation dependencies.
257836a7: Track collation versions for indexes.
cd6f479e: Add pg_depend.refobjversion.
7d1297df: Remove pg_collation.collversion.

Discussion: https://postgr.es/m/CA%2BhUKGLhj5t1fcjqAu8iD9B3ixJtsTNqyCCD4V0aTO9kAKAjjA%40mail.gmail.com
2021-05-07 21:10:11 +12:00
Thomas Munro
7d1297df08 Remove pg_collation.collversion.
This model couldn't be extended to cover the default collation, and
didn't have any information about the affected database objects when the
version changed.  Remove, in preparation for a follow-up commit that
will add a new mechanism.

Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
2020-11-03 00:44:59 +13:00
Bruce Momjian
8d4b23fcae doc: make ref/*.sgml file header comment layout consistent 2020-05-15 08:52:24 -04:00
Peter Eisentraut
5e1963fb76 Collations with nondeterministic comparison
This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

This functionality is only supported with the ICU provider.  At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).

This patch makes changes in three areas:

- CREATE COLLATION DDL changes and system catalog changes to support
  this new flag.

- Many executor nodes and auxiliary code are extended to track
  collations.  Previously, this code would just throw away collation
  information, because the eventually-called user-defined functions
  didn't use it since they only cared about equality, which didn't
  need collation information.

- String data type functions that do equality comparisons and hashing
  are changed to take the (non-)deterministic flag into account.  For
  comparison, this just means skipping various shortcuts and tie
  breakers that use byte-wise comparison.  For hashing, we first need
  to convert the input string to a canonical "sort key" using the ICU
  analogue of strxfrm().

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
2019-03-22 12:12:43 +01:00
Peter Eisentraut
70de0abdb7 doc: Improve CREATE COLLATION locking documentation
Move out of the concurrency control chapter, where mostly only user
table locks are discussed, and move to CREATE COLLATION reference page.

Author: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
2018-07-30 22:09:48 +02:00
Peter Eisentraut
3c49c6facb Convert documentation to DocBook XML
Since some preparation work had already been done, the only source
changes left were changing empty-element tags like <xref linkend="foo">
to <xref linkend="foo"/>, and changing the DOCTYPE.

The source files are still named *.sgml, but they are actually XML files
now.  Renaming could be considered later.

In the build system, the intermediate step to convert from SGML to XML
is removed.  Everything is build straight from the source files again.
The OpenSP (or the old SP) package is no longer needed.

The documentation toolchain instructions are updated and are much
simpler now.

Peter Eisentraut, Alexander Lakhin, Jürgen Purtz
2017-11-23 09:44:28 -05:00
Peter Eisentraut
1ff01b3902 Convert SGML IDs to lower case
IDs in SGML are case insensitive, and we have accumulated a mix of upper
and lower case IDs, including different variants of the same ID.  In
XML, these will be case sensitive, so we need to fix up those
differences.  Going to all lower case seems most straightforward, and
the current build process already makes all anchors and lower case
anyway during the SGML->XML conversion, so this doesn't create any
difference in the output right now.  A future XML-only build process
would, however, maintain any mixed case ID spellings in the output, so
that is another reason to clean this up beforehand.

Author: Alexander Lakhin <exclusion@gmail.com>
2017-10-20 19:26:10 -04:00
Peter Eisentraut
c29c578908 Don't use SGML empty tags
For DocBook XML compatibility, don't use SGML empty tags (</>) anymore,
replace by the full tag name.  Add a warning option to catch future
occurrences.

Alexander Lakhin, Jürgen Purtz
2017-10-17 15:10:33 -04:00
Peter Eisentraut
f41bd4cb90 Expand collation documentation
Document better how to create custom collations and what locale strings
ICU accepts.  Explain the ICU examples in more detail.  Also update the
text on the CREATE COLLATION reference page a bit to take ICU more into
account.
2017-10-02 11:51:45 -04:00
Tom Lane
de0c60b7d3 Doc: minor improvements for collation-related man pages. 2017-06-25 12:27:04 -04:00
Peter Eisentraut
eccfef81e1 ICU support
Add a column collprovider to pg_collation that determines which library
provides the collation data.  The existing choices are default and libc,
and this adds an icu choice, which uses the ICU4C library.

The pg_locale_t type is changed to a union that contains the
provider-specific locale handles.  Users of locale information are
changed to look into that struct for the appropriate handle to use.

Also add a collversion column that records the version of the collation
when it is created, and check at run time whether it is still the same.
This detects potentially incompatible library upgrades that can corrupt
indexes and other structures.  This is currently only supported by
ICU-provided collations.

initdb initializes the default collation set as before from the `locale
-a` output but also adds all available ICU locales with a "-x-icu"
appended.

Currently, ICU-provided collations can only be explicitly named
collations.  The global database locales are still always libc-provided.

ICU support is enabled by configure --with-icu.

Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
2017-03-23 15:28:48 -04:00
Peter Eisentraut
6d16ecc646 Add CREATE COLLATION IF NOT EXISTS clause
The core of the functionality was already implemented when
pg_import_system_collations was added.  This just exposes it as an
option in the SQL command.
2017-02-15 10:01:28 -05:00
Peter Eisentraut
bb4eefe7bf doc: Improve DocBook XML validity
DocBook XML is superficially compatible with DocBook SGML but has a
slightly stricter DTD that we have been violating in a few cases.
Although XSLT doesn't care whether the document is valid, the style
sheets don't necessarily process invalid documents correctly, so we need
to work toward fixing this.

This first commit moves the indexterms in refentry elements to an
allowed position.  It has no impact on the output.
2014-02-23 21:31:08 -05:00
Tom Lane
9e9b9ac7d1 Make a code-cleanup pass over the collations patch.
This patch is almost entirely cosmetic --- mostly cleaning up a lot of
neglected comments, and fixing code layout problems in places where the
patch made lines too long and then pgindent did weird things with that.
I did find a bug-of-omission in equalTupleDescs().
2011-04-22 17:43:18 -04:00
Tom Lane
a612b17120 Assorted editing for collation documentation.
I made a pass over this to familiarize myself with the feature, and found
some things that could be improved.
2011-03-08 17:10:59 -05:00
Peter Eisentraut
b313bca0af DDL support for collations
- collowner field
- CREATE COLLATION
- ALTER COLLATION
- DROP COLLATION
- COMMENT ON COLLATION
- integration with extensions
- pg_dump support for the above
- dependency management
- psql tab completion
- psql \dO command
2011-02-12 15:55:18 +02:00