mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-31 10:30:33 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			326 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			326 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| PostgreSQL 7.0 multi-byte (MB) support README	  Mar 22 2000
 | |
| 
 | |
| 						Tatsuo Ishii
 | |
| 						ishii@postgresql.org
 | |
| 		  http://www.sra.co.jp/people/t-ishii/PostgreSQL/
 | |
| 
 | |
| 0. Introduction
 | |
| 
 | |
| The MB support is intended for allowing PostgreSQL to handle
 | |
| multi-byte character sets such as EUC(Extended Unix Code), Unicode and
 | |
| Mule internal code. With the MB enabled you can use multi-byte
 | |
| character sets in regexp ,LIKE and some other functions. The default
 | |
| encoding system chosen is determined while initializing your
 | |
| PostgreSQL installation using initdb(1). Note that this can be
 | |
| overridden when you create a database using createdb(1) or by using a
 | |
| create database SQL command. So you could have multiple databases with
 | |
| each different encoding system.
 | |
| 
 | |
| MB also fixes some problems concerning with 8-bit single byte
 | |
| character sets including ISO8859. (I would not say all of problems
 | |
| have been fixed. I just confirmed that the regression test ran fine
 | |
| and a few French characters could be used with the patch. Please let
 | |
| me know if you find any problem while using 8-bit characters)
 | |
| 
 | |
| 1. How to use
 | |
| 
 | |
| run configure with a multibyte option:
 | |
| 
 | |
| 	% ./configure --enable-multibyte[=encoding_system]
 | |
| 
 | |
| where the encoding_system is one of:
 | |
| 
 | |
| 	SQL_ASCII		ASCII
 | |
| 	EUC_JP			Japanese EUC
 | |
| 	EUC_CN			Chinese EUC
 | |
| 	EUC_KR			Korean EUC
 | |
| 	EUC_TW			Taiwan EUC
 | |
| 	UNICODE			Unicode(UTF-8)
 | |
| 	MULE_INTERNAL		Mule internal
 | |
| 	LATIN1			ISO 8859-1 English and some European languages
 | |
| 	LATIN2			ISO 8859-2 English and some European languages
 | |
| 	LATIN3			ISO 8859-3 English and some European languages
 | |
| 	LATIN4			ISO 8859-4 English and some European languages
 | |
| 	LATIN5			ISO 8859-5 English and some European languages
 | |
| 	KOI8			KOI8-R
 | |
| 	WIN			Windows CP1251
 | |
| 	ALT			Windows CP866
 | |
| 
 | |
| Example:
 | |
| 
 | |
| 	% ./configure --enable-multibyte=EUC_JP
 | |
| 
 | |
| If the encoding system is omitted (./configure --enable-multibyte),
 | |
| SQL_ASCII is assumed.
 | |
| 
 | |
| 2. How to set the encoding
 | |
| 
 | |
| initdb command defines the default encoding for a PostgreSQL
 | |
| installation. For example:
 | |
| 
 | |
| 	% initdb -E EUC_JP
 | |
| 
 | |
| sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
 | |
| Note that you can use "--encoding" instead of "-E" if you like longer
 | |
| option string:-) If no -E or --encoding option is given, the encoding
 | |
| specified at the compile time is used.
 | |
| 
 | |
| You can create a database with a different encoding.
 | |
| 
 | |
| 	% createdb -E EUC_KR korean
 | |
| 
 | |
| will create a database named "korean" with EUC_KR encoding. The
 | |
| another way to accomplish this is to use a SQL command:
 | |
| 
 | |
| 	CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
 | |
| 
 | |
| The encoding for a database is represented as "encoding" column in the
 | |
| pg_database system catalog. You can see that by using -l or \l of psql
 | |
| command.
 | |
| 
 | |
| $ psql -l
 | |
|             List of databases
 | |
|    Database    |  Owner  |   Encoding    
 | |
| ---------------+---------+---------------
 | |
|  euc_cn        | t-ishii | EUC_CN
 | |
|  euc_jp        | t-ishii | EUC_JP
 | |
|  euc_kr        | t-ishii | EUC_KR
 | |
|  euc_tw        | t-ishii | EUC_TW
 | |
|  mule_internal | t-ishii | MULE_INTERNAL
 | |
|  regression    | t-ishii | SQL_ASCII
 | |
|  template1     | t-ishii | EUC_JP
 | |
|  test          | t-ishii | EUC_JP
 | |
|  unicode       | t-ishii | UNICODE
 | |
| (9 rows)
 | |
| 
 | |
| 3. Automatic encoding translation between backend and frontend
 | |
| 
 | |
| PostgreSQL supports an automatic encoding translation between backend
 | |
| and frontend for some encodings.
 | |
| 
 | |
|   encoding of backend			available encoding of frontend
 | |
|   --------------------------------------------------------------------
 | |
| 	EUC_JP				EUC_JP, SJIS
 | |
|   
 | |
| 	EUC_TW				EUC_TW, BIG5
 | |
|   
 | |
|   	LATIN2				LATIN2, WIN1250
 | |
|   
 | |
| 	LATIN5				LATIN5, WIN, ALT
 | |
|   
 | |
| 	MULE_INTERNAL			EUC_JP, SJIS, EUC_KR, EUC_CN, 
 | |
| 					EUC_TW, BIG5, LATIN1 to LATIN5, 
 | |
| 					WIN, ALT, WIN1250
 | |
| 
 | |
| To enable the automatic encoding translation, you have to tell
 | |
| PostgreSQL the encoding you would like to use in frontend. There are
 | |
| several ways to accomplish this.
 | |
| 
 | |
| o using \encoding command in psql
 | |
| 
 | |
| \encoding allows you to change frontend encoding on the fly. For
 | |
| example, to change the encoding to SJIS, type:
 | |
| 
 | |
| 	\encoding SJIS
 | |
| 
 | |
| o using libpq functions
 | |
| 
 | |
| \encoding actually calls PQsetClientEncoding() for its purpose.
 | |
| 
 | |
|   int PQsetClientEncoding(PGconn *conn, const char *encoding)
 | |
| 
 | |
| conn is a connection to the backend, and encoding is an encoding you
 | |
| want to use. If it successfully sets the encoding, it returns 0,
 | |
| otherwise -1. The current encoding for this connection can be shown by
 | |
| using:
 | |
| 
 | |
|   int PQclientEncoding(const PGconn *conn)
 | |
| 
 | |
| Note that it returns the "encoding id," not the encoding symbol string
 | |
| such as "EUC_JP." To convert an encoding id to an encoding symbol, you
 | |
| can use:
 | |
| 
 | |
| char *pg_encoding_to_char(int encoding_id)
 | |
| 
 | |
| o using PGCLIENTENCODING
 | |
| 
 | |
| If an environment variable PGCLIENTENCODING is defined in the
 | |
| frontend, an automatic encoding translation is done by the backend.
 | |
| 
 | |
| o using SET CLIENT_ENCODING TO command
 | |
| 
 | |
| Setting the frontend side encoding can be done a SQL command:
 | |
| 
 | |
| 	SET CLIENT_ENCODING TO 'encoding';
 | |
| 
 | |
| Also you can use SQL92 syntax "SET NAMES" for this purpose:
 | |
| 
 | |
| 	SET NAMES 'encoding';
 | |
| 
 | |
| To query the current the frontend encoding:
 | |
| 
 | |
| 	SHOW CLIENT_ENCODING;
 | |
| 
 | |
| To return to the default encoding:
 | |
| 
 | |
| 	RESET CLIENT_ENCODING;
 | |
| 
 | |
| 4. About Unicode
 | |
| 
 | |
| An automatic encoding translation between Unicode and any other
 | |
| encodings is not supported (yet). 
 | |
| 
 | |
| 5. What happens if the translation is not possible?
 | |
| 
 | |
| Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
 | |
| then some Japanese characters could not be translated into LATIN1. In
 | |
| this case, a letter cannot be represented in the LATIN1 character set,
 | |
| would be transformed as:
 | |
| 
 | |
| 	(HEXA DECIMAL)
 | |
| 
 | |
| 6. References
 | |
| 
 | |
| These are good sources to start learning various kind of encoding
 | |
| systems.
 | |
| 
 | |
| ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
 | |
| 	Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
 | |
| 	appear in section 3.2.
 | |
| 
 | |
| Unicode: http://www.unicode.org/
 | |
| 	The homepage of UNICODE.
 | |
| 
 | |
| 	RFC 2044
 | |
| 	UTF-8 is defined here.
 | |
| 
 | |
| 5. History
 | |
| 
 | |
| May 20, 2000
 | |
| 	* SJIS UDC (NEC selection IBM kanji) support contributed
 | |
| 	  by Eiji Tokuya
 | |
| 	* Changes above will appear in 7.0.1
 | |
| 
 | |
| Mar 22, 2000
 | |
| 	* Add new libpq functions PQsetClientEncoding, PQclientEncoding
 | |
| 	* ./configure --with-mb=EUC_JP
 | |
| 	  now deprecated. use 
 | |
| 	  ./configure --enable-multibyte=EUC_JP
 | |
| 	  instead
 | |
|   	* Add SQL_ASCII regression test case
 | |
| 	* Add SJIS User Defined Character (UDC) support
 | |
| 	* All of above will appear in 7.0
 | |
| 
 | |
| July 11, 1999
 | |
| 	* Add support for WIN1250 (Windows Czech) as a client encoding
 | |
| 	  (contributed by Pavel Behal)
 | |
| 	* fix some compiler warnings (contributed by Tomoaki Nishiyama)
 | |
| 
 | |
| Mar 23, 1999
 | |
| 	* Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
 | |
| 	  (thanks Oleg Broytmann for testing)
 | |
| 	* Fix problem with MB and locale
 | |
| 
 | |
| Jan 26, 1999
 | |
| 	* Add support for Big5 for fronend encoding
 | |
| 	  (you need to create a database with EUC_TW to use Big5)
 | |
| 	* Add regression test case for EUC_TW
 | |
| 	  (contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
 | |
| 
 | |
| Dec 15, 1998
 | |
| 	* Bugs related to SQL_ASCII support fixed
 | |
| 
 | |
| Nov 5, 1998
 | |
| 	* 6.4 release. In this version, pg_database has "encoding"
 | |
| 	  column that represents the database encoding
 | |
| 
 | |
| Jul 22, 1998
 | |
| 	* determine encoding at initdb/createdb rather than compile time
 | |
| 	* support for PGCLIENTENCODING when issuing COPY command
 | |
| 	* support for SQL92 syntax "SET NAMES"
 | |
| 	* support for LATIN2-5
 | |
| 	* add UNICODE regression test case
 | |
| 	* new test suite for MB
 | |
| 	* clean up source files
 | |
| 
 | |
| Jun 5, 1998
 | |
| 	* add support for the encoding translation between the backend
 | |
| 	  and the frontend
 | |
| 	* new command SET CLIENT_ENCODING etc. added
 | |
| 	* add support for LATIN1 character set
 | |
| 	* enhance 8 bit cleaness
 | |
| 
 | |
| April 21, 1998 some enhancements/fixes
 | |
| 	* character_length(), position(), substring() are now aware of 
 | |
| 	  multi-byte characters
 | |
| 	* add octet_length()
 | |
| 	* add --with-mb option to configure
 | |
| 	* new regression tests for EUC_KR
 | |
|   	  (contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
 | |
| 	* add some test cases to the EUC_JP regression test
 | |
| 	* fix problem in regress/regress.sh in case of System V
 | |
| 	* fix toupper(), tolower() to handle 8bit chars
 | |
| 
 | |
| Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
 | |
| 
 | |
| Mar 10, 1998 PL2 released
 | |
| 	* add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
 | |
| 	* add an English document (this file)
 | |
| 	* fix problems concerning 8-bit single byte characters
 | |
| 
 | |
| Mar 1, 1998 PL1 released
 | |
| 
 | |
| Appendix:
 | |
| 
 | |
| [Here is a good documentation explaining how to use WIN1250 on
 | |
| Windows/ODBC from Pavel Behal. Please note that Installation step 1)
 | |
| is not necceary in 6.5.1 -- Tatsuo]
 | |
| 
 | |
| Version: 0.91 for PgSQL 6.5
 | |
| Author: Pavel Behal
 | |
| Revised by: Tatsuo Ishii
 | |
| Email: behal@opf.slu.cz
 | |
| Licence: The Same as PostgreSQL
 | |
| 
 | |
| Sorry for my Eglish and C code, I'm not native :-)
 | |
| 
 | |
| !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 | |
| 
 | |
| Instalation:
 | |
| ------------
 | |
| 1) Change three affected files in source directories 
 | |
|     (I don't have time to create proper patch diffs, I don't know how)
 | |
| 2) Compile with enabled locale and multibyte set to LATIN2
 | |
| 3) Setup properly your instalation, do not forget to create locale
 | |
|    variables in your profile (environment). Ex. (may not be exactly true):
 | |
| 	LC_ALL=cs_CZ.ISO8859-2
 | |
| 	LC_COLLATE=cs_CZ.ISO8859-2
 | |
| 	LC_CTYPE=cs_CZ.ISO8859-2
 | |
| 	LC_MONETARY=cs_CZ.ISO8859-2
 | |
| 	LC_NUMERIC=cs_CZ.ISO8859-2
 | |
| 	LC_TIME=cs_CZ.ISO8859-2
 | |
| 4) You have to start the postmaster with locales set!
 | |
| 5) Try it with Czech language, it have to sort
 | |
| 5) Install ODBC driver for PgSQL into your M$ Windows
 | |
| 6) Setup properly your data source. Include this line in your ODBC
 | |
|    configuration dialog in field "Connect Settings:" :
 | |
| 	SET CLIENT_ENCODING = 'WIN1250';
 | |
| 7) Now try it again, but in Windows with ODBC.
 | |
| 
 | |
| Description:
 | |
| ------------
 | |
| - Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
 | |
|   with cs_CZ.iso8859-2 loacle
 | |
| - Never try to set-up server multibyte database encoding to WIN1250,
 | |
|   always use LATIN2 instead. There is not WIN1250 locale in Unix
 | |
| - WIN1250 encoding is useable only for M$W ODBC clients. The characters are
 | |
|   on thy fly re-coded, to be displayed and stored back properly
 | |
|  
 | |
| Important:
 | |
| ----------
 | |
| - it reorders your sort order depending on your LC_... setting, so don't be
 | |
|   confused with regression tests, they don't use locale
 | |
| - "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
 | |
| - you have to insert money as '162,50' (with comma in aphostrophes!)
 | |
| - not tested properly
 |