Add to locale TODO.detail.

2025-10-29 22:49:41 +03:00 · 2006-06-14 23:28:22 +00:00
parent a9d2cc4d76
commit e1e133f264
1 changed files with 377 additions and 0 deletions
--- a/doc/TODO.detail/locale
+++ b/doc/TODO.detail/locale
@@ -2662,3 +2662,380 @@ TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
 From pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org Tue Dec 27 06:30:14 2005
 Return-path: <pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org>
 Received: from ams.hub.org (ams.hub.org [200.46.204.13])
 	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBRDUET16504
 	for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 08:30:14 -0500 (EST)
 Received: from postgresql.org (postgresql.org [200.46.204.71])
 	by ams.hub.org (Postfix) with ESMTP id 3BCBA67A57F
 	for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 09:30:14 -0400 (AST)
 X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 Received: from localhost (av.hub.org [200.46.204.144])
 	by postgresql.org (Postfix) with ESMTP id 1DB3B9DC859
 	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Dec 2005 09:29:47 -0400 (AST)
 Received: from postgresql.org ([200.46.204.71])
 	by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
 	with ESMTP id 89101-04
 	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 	Tue, 27 Dec 2005 09:29:50 -0400 (AST)
 X-Greylist: from auto-whitelisted by SQLgrey-
 Received: from svana.org (svana.org [203.20.62.76])
 	by postgresql.org (Postfix) with ESMTP id 319839DC8A5
 	for <pgsql-hackers@postgresql.org>; Tue, 27 Dec 2005 09:29:44 -0400 (AST)
 Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian))
 	id 1ErEte-0001lm-00; Wed, 28 Dec 2005 00:29:46 +1100
 Date: Tue, 27 Dec 2005 14:29:46 +0100
 From: Martijn van Oosterhout <kleptog@svana.org>
 To: pgsql-hackers@postgresql.org
 Subject: [HACKERS] Proposed COLLATE implementation
 Message-ID: <20051227132941.GA32404@svana.org>
 Reply-To: Martijn van Oosterhout <kleptog@svana.org>
 MIME-Version: 1.0
 Content-Type: multipart/signed; micalg=pgp-sha1;
 	protocol="application/pgp-signature"; boundary="W/nzBZO5zC0uMSeA"
 Content-Disposition: inline
 User-Agent: Mutt/1.3.28i
 X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6
 X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522  48A7 6394 F08A 0DC6 7BE6
 X-PGP-Key-URL: <http://svana.org/kleptog/0DC67BE6.pgp.asc>
 X-Virus-Scanned: by amavisd-new at hub.org
 X-Spam-Status: No, score=0.08 required=5 tests=[AWL=0.080]
 X-Spam-Score: 0.08
 X-Mailing-List: pgsql-hackers
 List-Archive: <http://archives.postgresql.org/pgsql-hackers>
 List-Help: <mailto:majordomo@postgresql.org?body=help>
 List-Id: <pgsql-hackers.postgresql.org>
 List-Owner: <mailto:pgsql-hackers-owner@postgresql.org>
 List-Post: <mailto:pgsql-hackers@postgresql.org>
 List-Subscribe: <mailto:majordomo@postgresql.org?body=sub%20pgsql-hackers>
 List-Unsubscribe: <mailto:majordomo@postgresql.org?body=unsub%20pgsql-hackers>
 Precedence: bulk
 Sender: pgsql-hackers-owner@postgresql.org
 Status: OR
 --W/nzBZO5zC0uMSeA
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 Content-Transfer-Encoding: quoted-printable
 Greetings all,
 If you're not interested in COLLATE, operator classes or related
 things, stop now, this is quite a long email.
 Firstly, status. PostgreSQL doesn't really support collations at all.
 The order of strings is defined at initdb time by the locale then and
 cannot be changed later. We allow lists to be sorted in either
 ascending or decending order but that's about it. Whatever order there
 is is deduced from b-tree operator classes.
 The purpose of this patch is to raise collations to (reasonably) first
 class object. The idea is that you can define a collation across any
 type and that you will then be able to ORDER BY, GROUP BY and INDEX
 using that collation. A collation defines both order and equality.
 The SQL standard does define COLLATE although they only apply that to
 character strings. There are no predefined collations in the standard.
 There are rules about how collations and collation states propegate
 from the leaves of the parse tree all the way to the root. In its
 simplest form, columns and constants have defined collations which
 modify the behaviour of functions using these values. At any point in
 the parse tree the user can override the collation with an explicit
 <collate clause>. If there is ambiguity about what collation applies at
 any point for a function that needs to know, this is a error.
 All this is parse-time analysis.
 Proposed Implementation:
 NODES
 To implement the above, two new node types are created: CollateClause
 which represents the <collate clause> in SQL syntax, and CollateState
 which represents the actual state at any node. Currently the only nodes
 expected to require these are OpExpr, FuncExpr, Var and Const. Although
 I guess it may apply to any node that can be used in an expression.
 CATALOG CHANGES
 To track collations requires a new table in the catalog, which I have
 named pg_collations. It contains the following fields:
 Oid oid;           -- OID for this collation
 Name collname;     -- Name of the collation (for collate clause)
 bool collasc;      -- Ascending or descending
 Oid  collopclass;  -- Implementing Operator Class
 int4 colltype;     -- Currently, 0=3Dsimple, 1=3Duses locale
 Oid colllocale;    -- Locale in pg_locales
 (Should we be identifying the type here? or is it ok to lookup the type
 via the operator class).
 The first few fields name the collation so it can be referred to by the
 user. Then the collasc field determines how to use the operator class
 as given in the collopclass field. If it indicates descending order, it
 will invert the sense of the operator class. For example, asking for
 the GT op for a reverse collation will actually return the LT operator
 for the operator class.
 The purpose of the colltype and colllocale fields are described further
 down.
 The important thing at this point is that by specifying a collation you
 are also specifying an operator class. At the moment the ascending and
 descending collations for each type are hard-coded for initdb. At the
 moment they have been allocated OIDs starting at 2800, which is the
 first large available block.
 Each column of a table has a default collation, which defaults to the
 default collation of the type but can be specified in the table
 declaration. To store this requires an additional column in
 pg_attribute (attcollate) which contains the OID of the collation for
 that column. When it is referenced in a query, this collation is copied
 to the CollateState node of the Var node, from whence it can affect the
 query.
 Finally, to allow the parser to complain about ambiguous CollateStates,
 we need to indicate which functions actually need a senseble collate
 state to function. To this end a single boolean field has been added to
 pg_proc (proneedcollate). If this is true, the parser should error out
 when the collation state is COLLATE_NONE.
 INDEXES
 Another place you will be allowed to use the collate clause is while
 creating indexes. If you declare an index using a particular collation,
 it can be used in queries that order by the same collation. Note, that
 the collate clause indicates the operator class, so you can either
 specify one or the other, but not both.
 So each column of an index will also have a collation. However,
 pg_attribute has already got an extra field to store the collation for
 columns so it makes sense to store the collation here. In the process
 the pg_index.indclass field becomes redundant as it can be inferred
 from the pg_attribute rows associated with the index.
 To make this work there also needs to be a notion of compatability
 between collations. For example, two collations which are the reverse
 of eachother are compatable in the sense that an index defined with one
 collation would be usable for the other simply by scanning in reverse.
 FUNCTIONS
 In particular for string comparison but also possibly for user-defined
 types, a function will need to know what collation it is operating
 under. For this purpose an extra field (fn_collate) is added to
 FmgrInfo which is filled in with the collation from the expression tree
 (if any) or wherever relevent (eg. from the pg_attribute column when
 doing statistics or creating indexes).
 A PG_GETCOLLATE() macro is added to facilitate user-functions
 retreiving this data. This function throws an error when no collation
 has been defined. This shouldn't happen in practice as issues should
 have been weeded out at parse-time.
 This macro returns the OID of the collation but in many cases it will
 not be necessary. In particular, functions should NOT invert their
 result if the collation is inverted. It is considered the
 responsibility of the caller to invert the result if necessary. The
 reasons for this are:
 1. In most cases that matter (order comparison) the issue can be dealt
 with at parse time by the NEGATOR or COMMUTATOR options.
 2. For index scans, we would just do a reverse scan instead (or forward
 if the index is inverted)
 3. Requiring every function to check the collation for inversion is
 wasteful, since in many cases the case can be dealt with statically.
 DEFAULT COLLATIONS
 At this point I'm inclined to define a few collations to be built in or
 specially handled:
 COLLATE ASC    - default collation for type, ascending (ie, what we do now)
 COLLATE DESC   - default collation for type, inverted
 COLLATE POSIX  - For strings, define a simple bytewise string comparison.
 Indeed, it is expected that by default, all columns involving strings
 in the system catalog will always use COLLATE POSIX. Additionaly, type
 "name" will always use that collation, even if the user changes the
 default (by a method to be specified). This is straightforwardly done
 at initdb time.
 The purpose of COLLATE DESC is to simplify index declarations. Saying
 CREATE INDEX foo ON bar( a COLLATE ASC, b COLLATE DESC );
 would allow it to be used in a query using ORDER BY a, b DESC, without
 the user having to lookup the name of the collation.
 When it comes to naming collations, the question arises whether
 ascending/descending collations need to have different names. Or
 should there be two collations with the same name with ASC/DESC as a
 modifier? Do collations have to be unique across different types; for
 example, can varchar and text both have a collation "ignorecase"?
 Another issue is that a column could be declared with a descending
 collation by default. Say it was an integer column, then (a < 5) would
 return FALSE for a =3D 1. While technically correct, I'm thinking of
 ruling it out on the basis of being too confusing, and only allow
 descending collations at query time or in index specifications.
 Another strange point at the moment is how to determine the default
 collation of a type. At the moment it is done by finding the default
 operator class and looking up the ascending version of that. However,
 we may want to provide the user a way of specifying it directly,
 perhaps by:
 ALTER TYPE text SET DEFAULT COLLATION ignorecase;
 PATHKEYS
 Currently during planning, pathkeys are indicated by an operator of the
 operator class. Here we would simply replace that with the oid of the
 collation, which can be matched directly with the collation defined by
 the index.
 USER DEFINED TYPES
 None of this is interesting unless it can be applied to user-defined
 types also. Fortunatly this is easy, when the user declares a b-tree
 operator class, we can generate the collations automatically. We may
 even allow the user to specify the name of the collation. However, if
 the user wanted to create multiple collations based on the same
 operator class (by using the PG_GETCOLLATE() macro above, we may want
 to provide them a way of creating them directly.
 COLLATIONS USING LOCALES
 For strings, collation can be done in many different ways defined by
 what is referred to as a locale. As indicated above in the definition
 of pg_collations, there is a colltype field. For most collations this
 will be 0 (simple collation). However, for strings the intention is to
 use a type 1 (using locales). In this case the last column refers to
 the OID of the locale, so you can many collations using the same
 operator class, but different locale oids. On a system level it changes
 nothing, but inside the functions implementing it, they should use
 PG_GETLOCALE(). This will return an opaque pg_locale_t (see below)
 handle which can then be used to implement the specifics.
 In principle, user-defined types need to be able to use this also,
 perhaps by using the clause COLLATE USING LOCALE in the operator class.
 In theory there should a collation for each combination of
 locale-dependant datatype, locale and order ascending/descending.
 How/when these are created has not yet be determined.
 MORE TYPES OF COLLATION
 Another collation type I've speculated about but not thought about
 implementing is a "mapping collation", in which you map the values
 through a function and then collate that. The obvious example would be
 a case-insensetive mapping where lower is applied before collation.
 Implementation could be pretty much done by simply substituting the
 functions into the parse tree. For example, if you defined something
 like:
 CREATE COLLATION ignorecase ON text USING lower($1) COLLATE defaulttext;
 Then anytime you did a comparison with that collation, you would simply
 insert those function calls into the parse-tree and then collate with
 "defaulttext". When declaring an index you would just make it a
 functional index. The rules for functional indexes should make it work
 out-of-the-box.
 OTHER TECHNICAL ISSUES
 - Applying a COLLATE clause to an unknown literal causes it to be
 coerced to the type that collation is based on. But what about if we
 have something like COLLATE DESC?
 - This requires some changes in the bootstrap procedures given that we
 need to be able to do lookups on the operator class for each type
 fairly early on. Some are predefined but it does require moving the
 opclass setup further up the list. However, if we store a default
 collation in pg_type, we wouldn't need to do that.
 - Sorting arrays. Should they get their own collations, or should
 they use the collations of their base types.
 LOCALES
 I've left this to the end because I don't want people distracted by
 what is essentially a side-issue. While this would be needed to
 implement COLLATE the way the SQL spec intended, it can actually be
 implemented and dealt with as a seperate patch. The main reason a basic
 implementation exists is that it provides a great way of finding places
 that didn't define a collation, since any comparison involving "text"
 requires one.
 To deal with locales I created another table in the catalog,
 pg_locales. This provides an OID which can be referenced from
 elsewhere, such as the pg_collations table.
 The design is intended to provide some pluggability, so locale
 information can come from multiple sources. Also, each locale will be
 referenced by an identifier which is unrelated to any external
 identifier, so we're not bound by them.
 The columns defined currently are:
 Name	locname       - Identifier used by postgresql
 Name	locsysname    - String identifying the locale for the locale provider
 int4	locsource     - System providing this locale
 int4	locencoding   - Encoding expected by provider
 It is expected that the list of sources for locale data will be short,
 probably hard-coded into the backend (currenty internal/system/icu).
 The only locale defined at startup is POSIX, which is implemented
 internally. The intention is for any other locales to be defined at the
 end of initdb. The expected syntax is something like:
 CREATE LOCALE hungarian AS 'hu_HU' USING glibc;
 This should use the provider to check the locale exists and has a
 conpatible encoding. If so it is entered into the table ready for use.
 In the backend, there will be implementations of functions like
 pg_strcoll_l, pg_localeconv_l, which work like the C system library
 versions only they take an extra pg_locale_t argument. This is used to
 dispatch the call to the right place. There will be a function to
 quickly determine if a locale is C to shortcircuit complexity where it
 is not needed.
 STATUS
 Implementation so far is available here:
 http://svana.org/kleptog/temp/collate-current.patch.gz
 This patch isn't "clean" and changes a few things that are not strictly
 necessary. It won't finish initdb right now because it gets an error in
 ANALYSE (the array issue above).
 Feedback, help, comments: please reply.
 Have a nice day,
 --=20
 Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
 > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 > tool for doing 5% of the work and then sitting around waiting for someone
 > else to do the other 95% so you can sue them.
 --W/nzBZO5zC0uMSeA
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.0.6 (GNU/Linux)
 Comment: For info see http://www.gnupg.org
 iD8DBQFDsUHDIB7bNG8LQkwRAnh0AJ0YUNLkVaSY3u0jWBPdlaq+9dujZACfTmis
 JR1mF60lKx14Ih850p3lpVk=
 =/Ghs
 -----END PGP SIGNATURE-----
 --W/nzBZO5zC0uMSeA--