mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			293 lines
		
	
	
		
			9.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			293 lines
		
	
	
		
			9.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
postgresql 6.5.1 multi-byte (MB) support README	  July 11 1999
 | 
						|
 | 
						|
						Tatsuo Ishii
 | 
						|
						t-ishii@sra.co.jp
 | 
						|
		  http://www.sra.co.jp/people/t-ishii/PostgreSQL/
 | 
						|
 | 
						|
0. Introduction
 | 
						|
 | 
						|
The MB support is intended for allowing PostgreSQL to handle
 | 
						|
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
 | 
						|
Mule internal code. With the MB enabled you can use multi-byte
 | 
						|
character sets in regexp ,LIKE and some functions. The default
 | 
						|
encoding system chosen is determined while initializing your
 | 
						|
PostgreSQL installation using initdb(1). Note that this can be
 | 
						|
overridden when you create a database using createdb(1) or create
 | 
						|
database SQL command. So you could have multiple databases with
 | 
						|
different encoding systems.
 | 
						|
 | 
						|
MB also fixes some problems concerning with 8-bit single byte
 | 
						|
character sets including ISO8859. (I would not say all of problems
 | 
						|
have been fixed. I just confirmed that the regression test ran fine
 | 
						|
and a few French characters could be used with the patch. Please let
 | 
						|
me know if you find any problem while using 8-bit characters)
 | 
						|
 | 
						|
1. How to use
 | 
						|
 | 
						|
run configure with the mb option:
 | 
						|
 | 
						|
	% configure --with-mb=encoding_system
 | 
						|
 | 
						|
where encoding_system is one of:
 | 
						|
 | 
						|
	SQL_ASCII		ASCII
 | 
						|
	EUC_JP			Japanese EUC
 | 
						|
	EUC_CN			Chinese EUC
 | 
						|
	EUC_KR			Korean EUC
 | 
						|
	EUC_TW			Taiwan EUC
 | 
						|
	UNICODE			Unicode(UTF-8)
 | 
						|
	MULE_INTERNAL		Mule internal
 | 
						|
	LATIN1			ISO 8859-1 English and some European languages
 | 
						|
	LATIN2			ISO 8859-2 English and some European languages
 | 
						|
	LATIN3			ISO 8859-3 English and some European languages
 | 
						|
	LATIN4			ISO 8859-4 English and some European languages
 | 
						|
	LATIN5			ISO 8859-5 English and some European languages
 | 
						|
	KOI8			KOI8-R
 | 
						|
	WIN			Windows CP1251
 | 
						|
	ALT			Windows CP866
 | 
						|
 | 
						|
Example:
 | 
						|
 | 
						|
	% configure --with-mb=EUC_JP
 | 
						|
 | 
						|
If MB is disabled, nothing is changed except better supporting for
 | 
						|
8-bit single byte character sets.
 | 
						|
 | 
						|
2. How to set encoding
 | 
						|
 | 
						|
initdb command defines the default encoding for a PostgreSQL
 | 
						|
installation. For example:
 | 
						|
 | 
						|
	% initdb -e EUC_JP
 | 
						|
 | 
						|
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
 | 
						|
Note that you can use "-pgencoding" instead of "-e" if you like longer
 | 
						|
option string:-) If no -e or -pgencoding option is given, the encoding
 | 
						|
specified at the compile time is used.
 | 
						|
 | 
						|
You can create a database with a different encoding.
 | 
						|
 | 
						|
	% createdb -E EUC_KR korean
 | 
						|
 | 
						|
will create a database named "korean" with EUC_KR encoding. The
 | 
						|
another way to accomplish this is to use a SQL command:
 | 
						|
 | 
						|
	CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
 | 
						|
 | 
						|
The encoding for a database is represented as "encoding" column in the
 | 
						|
pg_database system catalog.
 | 
						|
 | 
						|
	datname      |datdba|encoding|datpath      
 | 
						|
	-------------+------+--------+-------------
 | 
						|
	template1    |  1739|       1|template1    
 | 
						|
	postgres     |  1739|       0|postgres     
 | 
						|
	euc_jp       |  1739|       1|euc_jp       
 | 
						|
	euc_kr       |  1739|       3|euc_kr       
 | 
						|
	euc_cn       |  1739|       2|euc_cn       
 | 
						|
	unicode      |  1739|       5|unicode      
 | 
						|
	mule_internal|  1739|       6|mule_internal
 | 
						|
 | 
						|
A number in the encoding column is "encoding id" and can be translated
 | 
						|
to the encoding name using pg_encoding command.
 | 
						|
 | 
						|
	$ pg_encoding 1
 | 
						|
	EUC_JP
 | 
						|
 | 
						|
If an argument to pg_encoding is not a number, then it is regarded as
 | 
						|
an encoding name and pg_encoding will return the encoding id.
 | 
						|
 | 
						|
	$ pg_encoding EUC_JP
 | 
						|
	1
 | 
						|
 | 
						|
3. PGCLIENTENCODING
 | 
						|
 | 
						|
If an environment variable PGCLIENTENCODING is defined on the
 | 
						|
frontend, automatic encoding translation is done by the backend. For
 | 
						|
example, if the backend has been compiled with MB=EUC_JP and
 | 
						|
PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
 | 
						|
system), then any SJIS strings coming from the frontend would be
 | 
						|
translated to EUC_JP before going into the parser. Outputs from the
 | 
						|
backend would be translated to SJIS of course.
 | 
						|
 | 
						|
Supported encodings for PGCLIENTENCODING are:
 | 
						|
 | 
						|
	SQL_ASCII		ASCII
 | 
						|
	EUC_JP			Japanese EUC
 | 
						|
	SJIS			Yet another Japanese encoding
 | 
						|
	EUC_CN			Chinese EUC
 | 
						|
	EUC_KR			Korean EUC
 | 
						|
	EUC_TW			Taiwan EUC
 | 
						|
	BIG5			Traditional Chinese
 | 
						|
	MULE_INTERNAL		Mule internal
 | 
						|
	LATIN1			ISO 8859-1 English and some European languages
 | 
						|
	LATIN2			ISO 8859-2 English and some European languages
 | 
						|
	LATIN3			ISO 8859-3 English and some European languages
 | 
						|
	LATIN4			ISO 8859-4 English and some European languages
 | 
						|
	LATIN5			ISO 8859-5 English and some European languages
 | 
						|
	KOI8			KOI8-R
 | 
						|
	WIN			Windows CP1251
 | 
						|
	ALT			Windows CP866
 | 
						|
	WIN1250			Windows CP1250 (Czech)
 | 
						|
 | 
						|
Note that UNICODE is not supported(yet). Also note that the
 | 
						|
translation is not always possible. Suppose you choose EUC_JP for the
 | 
						|
backend, LATIN1 for the frontend, then some Japanese characters cannot
 | 
						|
be translated into latin. In this case, a letter cannot be represented
 | 
						|
in the Latin character set, would be transformed as:
 | 
						|
 | 
						|
	(HEXA DECIMAL)
 | 
						|
 | 
						|
3. SET CLIENT_ENCODING TO command
 | 
						|
 | 
						|
Actually setting the frontend side encoding information is done by a
 | 
						|
new command:
 | 
						|
 | 
						|
	SET CLIENT_ENCODING TO 'encoding';
 | 
						|
 | 
						|
where encoding is one of the encodings those can be set to
 | 
						|
PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
 | 
						|
purpose:
 | 
						|
 | 
						|
	SET NAMES 'encoding';
 | 
						|
 | 
						|
To query the current the frontend encoding:
 | 
						|
 | 
						|
	SHOW CLIENT_ENCODING;
 | 
						|
 | 
						|
To return to the default encoding:
 | 
						|
 | 
						|
	RESET CLIENT_ENCODING;
 | 
						|
 | 
						|
This would reset the frontend encoding to same as the backend
 | 
						|
encoding, thus no encoding translation would be performed.
 | 
						|
 | 
						|
4. References
 | 
						|
 | 
						|
These are good sources to start learning various kind of encoding
 | 
						|
systems.
 | 
						|
 | 
						|
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
 | 
						|
	Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
 | 
						|
	appear in section 3.2.
 | 
						|
 | 
						|
Unicode: http://www.unicode.org/
 | 
						|
	The homepage of UNICODE.
 | 
						|
 | 
						|
	RFC 2044
 | 
						|
	UTF-8 is defined here.
 | 
						|
 | 
						|
5. History
 | 
						|
 | 
						|
July 11, 1999
 | 
						|
	* Add support for WIN1250 (Windows Czech) as a client encoding
 | 
						|
	  (contributed by Pavel Behal)
 | 
						|
	* fix some compiler warnings (contributed by Tomoaki Nishiyama)
 | 
						|
 | 
						|
Mar 23, 1999
 | 
						|
	* Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
 | 
						|
	  (thanks Oleg Broytmann for testing)
 | 
						|
	* Fix problem with MB and locale
 | 
						|
 | 
						|
Jan 26, 1999
 | 
						|
	* Add support for Big5 for fronend encoding
 | 
						|
	  (you need to create a database with EUC_TW to use Big5)
 | 
						|
	* Add regression test case for EUC_TW
 | 
						|
	  (contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
 | 
						|
 | 
						|
Dec 15, 1998
 | 
						|
	* Bugs related to SQL_ASCII support fixed
 | 
						|
 | 
						|
Nov 5, 1998
 | 
						|
	* 6.4 release. In this version, pg_database has "encoding"
 | 
						|
	  column that represents the database encoding
 | 
						|
 | 
						|
Jul 22, 1998
 | 
						|
	* determine encoding at initdb/createdb rather than compile time
 | 
						|
	* support for PGCLIENTENCODING when issuing COPY command
 | 
						|
	* support for SQL92 syntax "SET NAMES"
 | 
						|
	* support for LATIN2-5
 | 
						|
	* add UNICODE regression test case
 | 
						|
	* new test suite for MB
 | 
						|
	* clean up source files
 | 
						|
 | 
						|
Jun 5, 1998
 | 
						|
	* add support for the encoding translation between the backend
 | 
						|
	  and the frontend
 | 
						|
	* new command SET CLIENT_ENCODING etc. added
 | 
						|
	* add support for LATIN1 character set
 | 
						|
	* enhance 8 bit cleaness
 | 
						|
 | 
						|
April 21, 1998 some enhancements/fixes
 | 
						|
	* character_length(), position(), substring() are now aware of 
 | 
						|
	  multi-byte characters
 | 
						|
	* add octet_length()
 | 
						|
	* add --with-mb option to configure
 | 
						|
	* new regression tests for EUC_KR
 | 
						|
  	  (contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
 | 
						|
	* add some test cases to the EUC_JP regression test
 | 
						|
	* fix problem in regress/regress.sh in case of System V
 | 
						|
	* fix toupper(), tolower() to handle 8bit chars
 | 
						|
 | 
						|
Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
 | 
						|
 | 
						|
Mar 10, 1998 PL2 released
 | 
						|
	* add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
 | 
						|
	* add an English document (this file)
 | 
						|
	* fix problems concerning 8-bit single byte characters
 | 
						|
 | 
						|
Mar 1, 1998 PL1 released
 | 
						|
 | 
						|
Appendix:
 | 
						|
 | 
						|
[Here is a good documentation explaining how to use WIN1250 on
 | 
						|
Windows/ODBC from Pavel Behal. Please note that Installation step 1)
 | 
						|
is not necceary in 6.5.1 -- Tatsuo]
 | 
						|
 | 
						|
Version: 0.91 for PgSQL 6.5
 | 
						|
Author: Pavel Behal
 | 
						|
Revised by: Tatsuo Ishii
 | 
						|
Email: behal@opf.slu.cz
 | 
						|
Licence: The Same as PostgreSQL
 | 
						|
 | 
						|
Sorry for my Eglish and C code, I'm not native :-)
 | 
						|
 | 
						|
!!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 | 
						|
 | 
						|
Instalation:
 | 
						|
------------
 | 
						|
1) Change three affected files in source directories 
 | 
						|
    (I don't have time to create proper patch diffs, I don't know how)
 | 
						|
2) Compile with enabled locale and multibyte set to LATIN2
 | 
						|
3) Setup properly your instalation, do not forget to create locale
 | 
						|
   variables in your profile (environment). Ex. (may not be exactly true):
 | 
						|
	LC_ALL=cs_CZ.ISO8859-2
 | 
						|
	LC_COLLATE=cs_CZ.ISO8859-2
 | 
						|
	LC_CTYPE=cs_CZ.ISO8859-2
 | 
						|
	LC_MONETARY=cs_CZ.ISO8859-2
 | 
						|
	LC_NUMERIC=cs_CZ.ISO8859-2
 | 
						|
	LC_TIME=cs_CZ.ISO8859-2
 | 
						|
4) You have to start the postmaster with locales set!
 | 
						|
5) Try it with Czech language, it have to sort
 | 
						|
5) Install ODBC driver for PgSQL into your M$ Windows
 | 
						|
6) Setup properly your data source. Include this line in your ODBC
 | 
						|
   configuration dialog in field "Connect Settings:" :
 | 
						|
	SET CLIENT_ENCODING = 'WIN1250';
 | 
						|
7) Now try it again, but in Windows with ODBC.
 | 
						|
 | 
						|
Description:
 | 
						|
------------
 | 
						|
- Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
 | 
						|
  with cs_CZ.iso8859-2 loacle
 | 
						|
- Never try to set-up server multibyte database encoding to WIN1250,
 | 
						|
  always use LATIN2 instead. There is not WIN1250 locale in Unix
 | 
						|
- WIN1250 encoding is useable only for M$W ODBC clients. The characters are
 | 
						|
  on thy fly re-coded, to be displayed and stored back properly
 | 
						|
 
 | 
						|
Important:
 | 
						|
----------
 | 
						|
- it reorders your sort order depending on your LC_... setting, so don't be
 | 
						|
  confused with regression tests, they don't use locale
 | 
						|
- "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
 | 
						|
- you have to insert money as '162,50' (with comma in aphostrophes!)
 | 
						|
- not tested properly
 |