mirror of
				https://sourceware.org/git/glibc.git
				synced 2025-10-24 13:33:08 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			2896 lines
		
	
	
		
			122 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			2896 lines
		
	
	
		
			122 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| @node Character Set Handling, Locales, String and Array Utilities, Top
 | |
| @c %MENU% Support for extended character sets
 | |
| @chapter Character Set Handling
 | |
| 
 | |
| @ifnottex
 | |
| @macro cal{text}
 | |
| \text\
 | |
| @end macro
 | |
| @end ifnottex
 | |
| 
 | |
| Character sets used in the early days of computing had only six, seven,
 | |
| or eight bits for each character: there was never a case where more than
 | |
| eight bits (one byte) were used to represent a single character.  The
 | |
| limitations of this approach became more apparent as more people
 | |
| grappled with non-Roman character sets, where not all the characters
 | |
| that make up a language's character set can be represented by @math{2^8}
 | |
| choices.  This chapter shows the functionality that was added to the C
 | |
| library to support multiple character sets.
 | |
| 
 | |
| @menu
 | |
| * Extended Char Intro::              Introduction to Extended Characters.
 | |
| * Charset Function Overview::        Overview about Character Handling
 | |
|                                       Functions.
 | |
| * Restartable multibyte conversion:: Restartable multibyte conversion
 | |
|                                       Functions.
 | |
| * Non-reentrant Conversion::         Non-reentrant Conversion Function.
 | |
| * Generic Charset Conversion::       Generic Charset Conversion.
 | |
| @end menu
 | |
| 
 | |
| 
 | |
| @node Extended Char Intro
 | |
| @section Introduction to Extended Characters
 | |
| 
 | |
| A variety of solutions is available to overcome the differences between
 | |
| character sets with a 1:1 relation between bytes and characters and
 | |
| character sets with ratios of 2:1 or 4:1.  The remainder of this
 | |
| section gives a few examples to help understand the design decisions
 | |
| made while developing the functionality of the @w{C library}.
 | |
| 
 | |
| @cindex internal representation
 | |
| A distinction we have to make right away is between internal and
 | |
| external representation.  @dfn{Internal representation} means the
 | |
| representation used by a program while keeping the text in memory.
 | |
| External representations are used when text is stored or transmitted
 | |
| through some communication channel.  Examples of external
 | |
| representations include files waiting in a directory to be
 | |
| read and parsed.
 | |
| 
 | |
| Traditionally there has been no difference between the two representations.
 | |
| It was equally comfortable and useful to use the same single-byte
 | |
| representation internally and externally.  This comfort level decreases
 | |
| with more and larger character sets.
 | |
| 
 | |
| One of the problems to overcome with the internal representation is
 | |
| handling text that is externally encoded using different character
 | |
| sets.  Assume a program that reads two texts and compares them using
 | |
| some metric.  The comparison can be usefully done only if the texts are
 | |
| internally kept in a common format.
 | |
| 
 | |
| @cindex wide character
 | |
| For such a common format (@math{=} character set) eight bits are certainly
 | |
| no longer enough.  So the smallest entity will have to grow: @dfn{wide
 | |
| characters} will now be used.  Instead of one byte per character, two or
 | |
| four will be used instead.  (Three are not good to address in memory and
 | |
| more than four bytes seem not to be necessary).
 | |
| 
 | |
| @cindex Unicode
 | |
| @cindex ISO 10646
 | |
| As shown in some other part of this manual,
 | |
| @c !!! Ahem, wide char string functions are not yet covered -- drepper
 | |
| a completely new family has been created of functions that can handle wide
 | |
| character texts in memory.  The most commonly used character sets for such
 | |
| internal wide character representations are Unicode and @w{ISO 10646}
 | |
| (also known as UCS for Universal Character Set).  Unicode was originally
 | |
| planned as a 16-bit character set; whereas, @w{ISO 10646} was designed to
 | |
| be a 31-bit large code space.  The two standards are practically identical.
 | |
| They have the same character repertoire and code table, but Unicode specifies
 | |
| added semantics.  At the moment, only characters in the first @code{0x10000}
 | |
| code positions (the so-called Basic Multilingual Plane, BMP) have been
 | |
| assigned, but the assignment of more specialized characters outside this
 | |
| 16-bit space is already in progress.  A number of encodings have been
 | |
| defined for Unicode and @w{ISO 10646} characters:
 | |
| @cindex UCS-2
 | |
| @cindex UCS-4
 | |
| @cindex UTF-8
 | |
| @cindex UTF-16
 | |
| UCS-2 is a 16-bit word that can only represent characters
 | |
| from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
 | |
| and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
 | |
| ASCII characters are represented by ASCII bytes and non-ASCII characters
 | |
| by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
 | |
| of UCS-2 in which pairs of certain UCS-2 words can be used to encode
 | |
| non-BMP characters up to @code{0x10ffff}.
 | |
| 
 | |
| To represent wide characters the @code{char} type is not suitable.  For
 | |
| this reason the @w{ISO C} standard introduces a new type that is
 | |
| designed to keep one character of a wide character string.  To maintain
 | |
| the similarity there is also a type corresponding to @code{int} for
 | |
| those functions that take a single wide character.
 | |
| 
 | |
| @comment stddef.h
 | |
| @comment ISO
 | |
| @deftp {Data type} wchar_t
 | |
| This data type is used as the base type for wide character strings.
 | |
| In other words, arrays of objects of this type are the equivalent of
 | |
| @code{char[]} for multibyte character strings.  The type is defined in
 | |
| @file{stddef.h}.
 | |
| 
 | |
| The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not
 | |
| say anything specific about the representation.  It only requires that
 | |
| this type is capable of storing all elements of the basic character set.
 | |
| Therefore it would be legitimate to define @code{wchar_t} as @code{char},
 | |
| which might make sense for embedded systems.
 | |
| 
 | |
| But for GNU systems @code{wchar_t} is always 32 bits wide and, therefore,
 | |
| capable of representing all UCS-4 values and, therefore, covering all of
 | |
| @w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type
 | |
| and thereby follow Unicode very strictly.  This definition is perfectly
 | |
| fine with the standard, but it also means that to represent all
 | |
| characters from Unicode and @w{ISO 10646} one has to use UTF-16 surrogate
 | |
| characters, which is in fact a multi-wide-character encoding.  But
 | |
| resorting to multi-wide-character encoding contradicts the purpose of the
 | |
| @code{wchar_t} type.
 | |
| @end deftp
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftp {Data type} wint_t
 | |
| @code{wint_t} is a data type used for parameters and variables that
 | |
| contain a single wide character.  As the name suggests this type is the
 | |
| equivalent of @code{int} when using the normal @code{char} strings.  The
 | |
| types @code{wchar_t} and @code{wint_t} often have the same
 | |
| representation if their size is 32 bits wide but if @code{wchar_t} is
 | |
| defined as @code{char} the type @code{wint_t} must be defined as
 | |
| @code{int} due to the parameter promotion.
 | |
| 
 | |
| @pindex wchar.h
 | |
| This type is defined in @file{wchar.h} and was introduced in
 | |
| @w{Amendment 1} to @w{ISO C90}.
 | |
| @end deftp
 | |
| 
 | |
| As there are for the @code{char} data type macros are available for
 | |
| specifying the minimum and maximum value representable in an object of
 | |
| type @code{wchar_t}.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypevr Macro wint_t WCHAR_MIN
 | |
| The macro @code{WCHAR_MIN} evaluates to the minimum value representable
 | |
| by an object of type @code{wint_t}.
 | |
| 
 | |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
 | |
| @end deftypevr
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypevr Macro wint_t WCHAR_MAX
 | |
| The macro @code{WCHAR_MAX} evaluates to the maximum value representable
 | |
| by an object of type @code{wint_t}.
 | |
| 
 | |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90}.
 | |
| @end deftypevr
 | |
| 
 | |
| Another special wide character value is the equivalent to @code{EOF}.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypevr Macro wint_t WEOF
 | |
| The macro @code{WEOF} evaluates to a constant expression of type
 | |
| @code{wint_t} whose value is different from any member of the extended
 | |
| character set.
 | |
| 
 | |
| @code{WEOF} need not be the same value as @code{EOF} and unlike
 | |
| @code{EOF} it also need @emph{not} be negative.  In other words, sloppy
 | |
| code like
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   int c;
 | |
|   @dots{}
 | |
|   while ((c = getc (fp)) < 0)
 | |
|     @dots{}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| has to be rewritten to use @code{WEOF} explicitly when wide characters
 | |
| are used:
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   wint_t c;
 | |
|   @dots{}
 | |
|   while ((c = wgetc (fp)) != WEOF)
 | |
|     @dots{}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| @pindex wchar.h
 | |
| This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
 | |
| defined in @file{wchar.h}.
 | |
| @end deftypevr
 | |
| 
 | |
| 
 | |
| These internal representations present problems when it comes to storing
 | |
| and transmittal.  Because each single wide character consists of more
 | |
| than one byte, they are effected by byte-ordering.  Thus, machines with
 | |
| different endianesses would see different values when accessing the same
 | |
| data.  This byte ordering concern also applies for communication protocols
 | |
| that are all byte-based and therefore require that the sender has to
 | |
| decide about splitting the wide character in bytes.  A last (but not least
 | |
| important) point is that wide characters often require more storage space
 | |
| than a customized byte-oriented character set.
 | |
| 
 | |
| @cindex multibyte character
 | |
| @cindex EBCDIC
 | |
| For all the above reasons, an external encoding that is different from
 | |
| the internal encoding is often used if the latter is UCS-2 or UCS-4.
 | |
| The external encoding is byte-based and can be chosen appropriately for
 | |
| the environment and for the texts to be handled.  A variety of different
 | |
| character sets can be used for this external encoding (information that
 | |
| will not be exhaustively presented here--instead, a description of the
 | |
| major groups will suffice).  All of the ASCII-based character sets
 | |
| fulfill one requirement: they are "filesystem safe."  This means that
 | |
| the character @code{'/'} is used in the encoding @emph{only} to
 | |
| represent itself.  Things are a bit different for character sets like
 | |
| EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set
 | |
| family used by IBM), but if the operation system does not understand
 | |
| EBCDIC directly the parameters-to-system calls have to be converted
 | |
| first anyhow.
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| The simplest character sets are single-byte character sets.  There can
 | |
| be only up to 256 characters (for @w{8 bit} character sets), which is
 | |
| not sufficient to cover all languages but might be sufficient to handle
 | |
| a specific text.  Handling of a @w{8 bit} character sets is simple.  This
 | |
| is not true for other kinds presented later, and therefore, the
 | |
| application one uses might require the use of @w{8 bit} character sets.
 | |
| 
 | |
| @cindex ISO 2022
 | |
| @item
 | |
| The @w{ISO 2022} standard defines a mechanism for extended character
 | |
| sets where one character @emph{can} be represented by more than one
 | |
| byte.  This is achieved by associating a state with the text.
 | |
| Characters that can be used to change the state can be embedded in the
 | |
| text.  Each byte in the text might have a different interpretation in each
 | |
| state.  The state might even influence whether a given byte stands for a
 | |
| character on its own or whether it has to be combined with some more
 | |
| bytes.
 | |
| 
 | |
| @cindex EUC
 | |
| @cindex Shift_JIS
 | |
| @cindex SJIS
 | |
| In most uses of @w{ISO 2022} the defined character sets do not allow
 | |
| state changes that cover more than the next character.  This has the
 | |
| big advantage that whenever one can identify the beginning of the byte
 | |
| sequence of a character one can interpret a text correctly.  Examples of
 | |
| character sets using this policy are the various EUC character sets
 | |
| (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
 | |
| or Shift_JIS (SJIS, a Japanese encoding).
 | |
| 
 | |
| But there are also character sets using a state that is valid for more
 | |
| than one character and has to be changed by another byte sequence.
 | |
| Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
 | |
| 
 | |
| @item
 | |
| @cindex ISO 6937
 | |
| Early attempts to fix 8 bit character sets for other languages using the
 | |
| Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
 | |
| representing characters like the acute accent do not produce output
 | |
| themselves: one has to combine them with other characters to get the
 | |
| desired result.  For example, the byte sequence @code{0xc2 0x61}
 | |
| (non-spacing acute accent, followed by lower-case `a') to get the ``small
 | |
| a with  acute'' character.  To get the acute accent character on its own,
 | |
| one has to write @code{0xc2 0x20} (the non-spacing acute followed by a
 | |
| space).
 | |
| 
 | |
| Character sets like @w{ISO 6937} are used in some embedded systems such
 | |
| as teletex.
 | |
| 
 | |
| @item
 | |
| @cindex UTF-8
 | |
| Instead of converting the Unicode or @w{ISO 10646} text used internally,
 | |
| it is often also sufficient to simply use an encoding different than
 | |
| UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
 | |
| encoding: UTF-8.  This encoding is able to represent all of @w{ISO
 | |
| 10646} 31 bits in a byte string of length one to six.
 | |
| 
 | |
| @cindex UTF-7
 | |
| There were a few other attempts to encode @w{ISO 10646} such as UTF-7,
 | |
| but UTF-8 is today the only encoding that should be used.  In fact, with
 | |
| any luck UTF-8 will soon be the only external encoding that has to be
 | |
| supported.  It proves to be universally usable and its only disadvantage
 | |
| is that it favors Roman languages by making the byte string
 | |
| representation of other scripts (Cyrillic, Greek, Asian scripts) longer
 | |
| than necessary if using a specific character set for these scripts.
 | |
| Methods like the Unicode compression scheme can alleviate these
 | |
| problems.
 | |
| @end itemize
 | |
| 
 | |
| The question remaining is: how to select the character set or encoding
 | |
| to use.  The answer: you cannot decide about it yourself, it is decided
 | |
| by the developers of the system or the majority of the users.  Since the
 | |
| goal is interoperability one has to use whatever the other people one
 | |
| works with use.  If there are no constraints, the selection is based on
 | |
| the requirements the expected circle of users will have.  In other words,
 | |
| if a project is expected to be used in only, say, Russia it is fine to use
 | |
| KOI8-R or a similar character set.  But if at the same time people from,
 | |
| say, Greece are participating one should use a character set that allows
 | |
| all people to collaborate.
 | |
| 
 | |
| The most widely useful solution seems to be: go with the most general
 | |
| character set, namely @w{ISO 10646}.  Use UTF-8 as the external encoding
 | |
| and problems about users not being able to use their own language
 | |
| adequately are a thing of the past.
 | |
| 
 | |
| One final comment about the choice of the wide character representation
 | |
| is necessary at this point.  We have said above that the natural choice
 | |
| is using Unicode or @w{ISO 10646}.  This is not required, but at least
 | |
| encouraged, by the @w{ISO C} standard.  The standard defines at least a
 | |
| macro @code{__STDC_ISO_10646__} that is only defined on systems where
 | |
| the @code{wchar_t} type encodes @w{ISO 10646} characters.  If this
 | |
| symbol is not defined one should avoid making assumptions about the wide
 | |
| character representation.  If the programmer uses only the functions
 | |
| provided by the C library to handle wide character strings there should
 | |
| be no compatibility problems with other systems.
 | |
| 
 | |
| @node Charset Function Overview
 | |
| @section Overview about Character Handling Functions
 | |
| 
 | |
| A Unix @w{C library} contains three different sets of functions in two
 | |
| families to handle character set conversion.  One of the function families
 | |
| (the most commonly used) is specified in the @w{ISO C90} standard and,
 | |
| therefore, is portable even beyond the Unix world.  Unfortunately this
 | |
| family is the least useful one.  These functions should be avoided
 | |
| whenever possible, especially when developing libraries (as opposed to
 | |
| applications).
 | |
| 
 | |
| The second family of functions got introduced in the early Unix standards
 | |
| (XPG2) and is still part of the latest and greatest Unix standard:
 | |
| @w{Unix 98}.  It is also the most powerful and useful set of functions.
 | |
| But we will start with the functions defined in @w{Amendment 1} to
 | |
| @w{ISO C90}.
 | |
| 
 | |
| @node Restartable multibyte conversion
 | |
| @section Restartable Multibyte Conversion Functions
 | |
| 
 | |
| The @w{ISO C} standard defines functions to convert strings from a
 | |
| multibyte representation to wide character strings.  There are a number
 | |
| of peculiarities:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| The character set assumed for the multibyte encoding is not specified
 | |
| as an argument to the functions.  Instead the character set specified by
 | |
| the @code{LC_CTYPE} category of the current locale is used; see
 | |
| @ref{Locale Categories}.
 | |
| 
 | |
| @item
 | |
| The functions handling more than one character at a time require NUL
 | |
| terminated strings as the argument (i.e., converting blocks of text
 | |
| does not work unless one can add a NUL byte at an appropriate place).
 | |
| The GNU C library contains some extensions to the standard that allow
 | |
| specifying a size, but basically they also expect terminated strings.
 | |
| @end itemize
 | |
| 
 | |
| Despite these limitations the @w{ISO C} functions can be used in many
 | |
| contexts.  In graphical user interfaces, for instance, it is not
 | |
| uncommon to have functions that require text to be displayed in a wide
 | |
| character string if the text is not simple ASCII.  The text itself might
 | |
| come from a file with translations and the user should decide about the
 | |
| current locale, which determines the translation and therefore also the
 | |
| external encoding used.  In such a situation (and many others) the
 | |
| functions described here are perfect.  If more freedom while performing
 | |
| the conversion is necessary take a look at the @code{iconv} functions
 | |
| (@pxref{Generic Charset Conversion}).
 | |
| 
 | |
| @menu
 | |
| * Selecting the Conversion::     Selecting the conversion and its properties.
 | |
| * Keeping the state::            Representing the state of the conversion.
 | |
| * Converting a Character::       Converting Single Characters.
 | |
| * Converting Strings::           Converting Multibyte and Wide Character
 | |
|                                   Strings.
 | |
| * Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
 | |
| @end menu
 | |
| 
 | |
| @node Selecting the Conversion
 | |
| @subsection Selecting the conversion and its properties
 | |
| 
 | |
| We already said above that the currently selected locale for the
 | |
| @code{LC_CTYPE} category decides about the conversion that is performed
 | |
| by the functions we are about to describe.  Each locale uses its own
 | |
| character set (given as an argument to @code{localedef}) and this is the
 | |
| one assumed as the external multibyte encoding.  The wide character
 | |
| set is always UCS-4, at least on GNU systems.
 | |
| 
 | |
| A characteristic of each multibyte character set is the maximum number
 | |
| of bytes that can be necessary to represent one character.  This
 | |
| information is quite important when writing code that uses the
 | |
| conversion functions (as shown in the examples below).
 | |
| The @w{ISO C} standard defines two macros that provide this information.
 | |
| 
 | |
| 
 | |
| @comment limits.h
 | |
| @comment ISO
 | |
| @deftypevr Macro int MB_LEN_MAX
 | |
| @code{MB_LEN_MAX} specifies the maximum number of bytes in the multibyte
 | |
| sequence for a single character in any of the supported locales.  It is
 | |
| a compile-time constant and is defined in @file{limits.h}.
 | |
| @pindex limits.h
 | |
| @end deftypevr
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypevr Macro int MB_CUR_MAX
 | |
| @code{MB_CUR_MAX} expands into a positive integer expression that is the
 | |
| maximum number of bytes in a multibyte character in the current locale.
 | |
| The value is never greater than @code{MB_LEN_MAX}.  Unlike
 | |
| @code{MB_LEN_MAX} this macro need not be a compile-time constant, and in
 | |
| the GNU C library it is not.
 | |
| 
 | |
| @pindex stdlib.h
 | |
| @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
 | |
| @end deftypevr
 | |
| 
 | |
| Two different macros are necessary since strictly @w{ISO C90} compilers
 | |
| do not allow variable length array definitions, but still it is desirable
 | |
| to avoid dynamic allocation.  This incomplete piece of code shows the
 | |
| problem:
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   char buf[MB_LEN_MAX];
 | |
|   ssize_t len = 0;
 | |
| 
 | |
|   while (! feof (fp))
 | |
|     @{
 | |
|       fread (&buf[len], 1, MB_CUR_MAX - len, fp);
 | |
|       /* @r{@dots{} process} buf */
 | |
|       len -= used;
 | |
|     @}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| The code in the inner loop is expected to have always enough bytes in
 | |
| the array @var{buf} to convert one multibyte character.  The array
 | |
| @var{buf} has to be sized statically since many compilers do not allow a
 | |
| variable size.  The @code{fread} call makes sure that @code{MB_CUR_MAX}
 | |
| bytes are always available in @var{buf}.  Note that it isn't
 | |
| a problem if @code{MB_CUR_MAX} is not a compile-time constant.
 | |
| 
 | |
| 
 | |
| @node Keeping the state
 | |
| @subsection Representing the state of the conversion
 | |
| 
 | |
| @cindex stateful
 | |
| In the introduction of this chapter it was said that certain character
 | |
| sets use a @dfn{stateful} encoding.  That is, the encoded values depend
 | |
| in some way on the previous bytes in the text.
 | |
| 
 | |
| Since the conversion functions allow converting a text in more than one
 | |
| step we must have a way to pass this information from one call of the
 | |
| functions to another.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftp {Data type} mbstate_t
 | |
| @cindex shift state
 | |
| A variable of type @code{mbstate_t} can contain all the information
 | |
| about the @dfn{shift state} needed from one call to a conversion
 | |
| function to another.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{mbstate_t} is defined in @file{wchar.h}.  It was introduced in
 | |
| @w{Amendment 1} to @w{ISO C90}.
 | |
| @end deftp
 | |
| 
 | |
| To use objects of type @code{mbstate_t} the programmer has to define such
 | |
| objects (normally as local variables on the stack) and pass a pointer to
 | |
| the object to the conversion functions.  This way the conversion function
 | |
| can update the object if the current multibyte character set is stateful.
 | |
| 
 | |
| There is no specific function or initializer to put the state object in
 | |
| any specific state.  The rules are that the object should always
 | |
| represent the initial state before the first use, and this is achieved by
 | |
| clearing the whole variable with code such as follows:
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   mbstate_t state;
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   /* @r{from now on @var{state} can be used.}  */
 | |
|   @dots{}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| When using the conversion functions to generate output it is often
 | |
| necessary to test whether the current state corresponds to the initial
 | |
| state.  This is necessary, for example, to decide whether to emit
 | |
| escape sequences to set the state to the initial state at certain
 | |
| sequence points.  Communication protocols often require this.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun int mbsinit (const mbstate_t *@var{ps})
 | |
| The @code{mbsinit} function determines whether the state object pointed
 | |
| to by @var{ps} is in the initial state.  If @var{ps} is a null pointer or
 | |
| the object is in the initial state the return value is nonzero.  Otherwise
 | |
| it is zero.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is
 | |
| declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| Code using @code{mbsinit} often looks similar to this:
 | |
| 
 | |
| @c Fix the example to explicitly say how to generate the escape sequence
 | |
| @c to restore the initial state.
 | |
| @smallexample
 | |
| @{
 | |
|   mbstate_t state;
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   /* @r{Use @var{state}.}  */
 | |
|   @dots{}
 | |
|   if (! mbsinit (&state))
 | |
|     @{
 | |
|       /* @r{Emit code to return to initial state.}  */
 | |
|       const wchar_t empty[] = L"";
 | |
|       const wchar_t *srcp = empty;
 | |
|       wcsrtombs (outbuf, &srcp, outbuflen, &state);
 | |
|     @}
 | |
|   @dots{}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| The code to emit the escape sequence to get back to the initial state is
 | |
| interesting.  The @code{wcsrtombs} function can be used to determine the
 | |
| necessary output code (@pxref{Converting Strings}).  Please note that on
 | |
| GNU systems it is not necessary to perform this extra action for the
 | |
| conversion from multibyte text to wide character text since the wide
 | |
| character encoding is not stateful.  But there is nothing mentioned in
 | |
| any standard that prohibits making @code{wchar_t} using a stateful
 | |
| encoding.
 | |
| 
 | |
| @node Converting a Character
 | |
| @subsection Converting Single Characters
 | |
| 
 | |
| The most fundamental of the conversion functions are those dealing with
 | |
| single characters.  Please note that this does not always mean single
 | |
| bytes.  But since there is very often a subset of the multibyte
 | |
| character set that consists of single byte sequences, there are
 | |
| functions to help with converting bytes.  Frequently, ASCII is a subpart
 | |
| of the multibyte character set.  In such a scenario, each ASCII character
 | |
| stands for itself, and all other characters have at least a first byte
 | |
| that is beyond the range @math{0} to @math{127}.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun wint_t btowc (int @var{c})
 | |
| The @code{btowc} function (``byte to wide character'') converts a valid
 | |
| single byte character @var{c} in the initial shift state into the wide
 | |
| character equivalent using the conversion rules from the currently
 | |
| selected locale of the @code{LC_CTYPE} category.
 | |
| 
 | |
| If @code{(unsigned char) @var{c}} is no valid single byte multibyte
 | |
| character or if @var{c} is @code{EOF}, the function returns @code{WEOF}.
 | |
| 
 | |
| Please note the restriction of @var{c} being tested for validity only in
 | |
| the initial shift state.  No @code{mbstate_t} object is used from
 | |
| which the state information is taken, and the function also does not use
 | |
| any static state.
 | |
| 
 | |
| @pindex wchar.h
 | |
| The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90}
 | |
| and is declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| Despite the limitation that the single byte value is always interpreted
 | |
| in the initial state, this function is actually useful most of the time.
 | |
| Most characters are either entirely single-byte character sets or they
 | |
| are extension to ASCII.  But then it is possible to write code like this
 | |
| (not that this specific example is very useful):
 | |
| 
 | |
| @smallexample
 | |
| wchar_t *
 | |
| itow (unsigned long int val)
 | |
| @{
 | |
|   static wchar_t buf[30];
 | |
|   wchar_t *wcp = &buf[29];
 | |
|   *wcp = L'\0';
 | |
|   while (val != 0)
 | |
|     @{
 | |
|       *--wcp = btowc ('0' + val % 10);
 | |
|       val /= 10;
 | |
|     @}
 | |
|   if (wcp == &buf[29])
 | |
|     *--wcp = L'0';
 | |
|   return wcp;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| Why is it necessary to use such a complicated implementation and not
 | |
| simply cast @code{'0' + val % 10} to a wide character?  The answer is
 | |
| that there is no guarantee that one can perform this kind of arithmetic
 | |
| on the character of the character set used for @code{wchar_t}
 | |
| representation.  In other situations the bytes are not constant at
 | |
| compile time and so the compiler cannot do the work.  In situations like
 | |
| this, using @code{btowc} is required.
 | |
| 
 | |
| @noindent
 | |
| There is also a function for the conversion in the other direction.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun int wctob (wint_t @var{c})
 | |
| The @code{wctob} function (``wide character to byte'') takes as the
 | |
| parameter a valid wide character.  If the multibyte representation for
 | |
| this character in the initial state is exactly one byte long, the return
 | |
| value of this function is this character.  Otherwise the return value is
 | |
| @code{EOF}.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{wctob} was introduced in @w{Amendment 1} to @w{ISO C90} and
 | |
| is declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| There are more general functions to convert single character from
 | |
| multibyte representation to wide characters and vice versa.  These
 | |
| functions pose no limit on the length of the multibyte representation
 | |
| and they also do not require it to be in the initial state.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
 | |
| @cindex stateful
 | |
| The @code{mbrtowc} function (``multibyte restartable to wide
 | |
| character'') converts the next multibyte character in the string pointed
 | |
| to by @var{s} into a wide character and stores it in the wide character
 | |
| string pointed to by @var{pwc}.  The conversion is performed according
 | |
| to the locale currently selected for the @code{LC_CTYPE} category.  If
 | |
| the conversion for the character set used in the locale requires a state,
 | |
| the multibyte string is interpreted in the state represented by the
 | |
| object pointed to by @var{ps}.  If @var{ps} is a null pointer, a static,
 | |
| internal state variable used only by the @code{mbrtowc} function is
 | |
| used.
 | |
| 
 | |
| If the next multibyte character corresponds to the NUL wide character,
 | |
| the return value of the function is @math{0} and the state object is
 | |
| afterwards in the initial state.  If the next @var{n} or fewer bytes
 | |
| form a correct multibyte character, the return value is the number of
 | |
| bytes starting from @var{s} that form the multibyte character.  The
 | |
| conversion state is updated according to the bytes consumed in the
 | |
| conversion.  In both cases the wide character (either the @code{L'\0'}
 | |
| or the one found in the conversion) is stored in the string pointed to
 | |
| by @var{pwc} if @var{pwc} is not null.
 | |
| 
 | |
| If the first @var{n} bytes of the multibyte string possibly form a valid
 | |
| multibyte character but there are more than @var{n} bytes needed to
 | |
| complete it, the return value of the function is @code{(size_t) -2} and
 | |
| no value is stored.  Please note that this can happen even if @var{n}
 | |
| has a value greater than or equal to @code{MB_CUR_MAX} since the input
 | |
| might contain redundant shift sequences.
 | |
| 
 | |
| If the first @code{n} bytes of the multibyte string cannot possibly form
 | |
| a valid multibyte character, no value is stored, the global variable
 | |
| @code{errno} is set to the value @code{EILSEQ}, and the function returns
 | |
| @code{(size_t) -1}.  The conversion state is afterwards undefined.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
 | |
| is declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| Use of @code{mbrtowc} is straightforward.  A function that copies a
 | |
| multibyte string into a wide character string while at the same time
 | |
| converting all lowercase characters into uppercase could look like this
 | |
| (this is not the final version, just an example; it has no error
 | |
| checking, and sometimes leaks memory):
 | |
| 
 | |
| @smallexample
 | |
| wchar_t *
 | |
| mbstouwcs (const char *s)
 | |
| @{
 | |
|   size_t len = strlen (s);
 | |
|   wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
 | |
|   wchar_t *wcp = result;
 | |
|   wchar_t tmp[1];
 | |
|   mbstate_t state;
 | |
|   size_t nbytes;
 | |
| 
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
 | |
|     @{
 | |
|       if (nbytes >= (size_t) -2)
 | |
|         /* Invalid input string.  */
 | |
|         return NULL;
 | |
|       *wcp++ = towupper (tmp[0]);
 | |
|       len -= nbytes;
 | |
|       s += nbytes;
 | |
|     @}
 | |
|   return result;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| The use of @code{mbrtowc} should be clear.  A single wide character is
 | |
| stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
 | |
| in the variable @var{nbytes}.  If the conversion is successful, the
 | |
| uppercase variant of the wide character is stored in the @var{result}
 | |
| array and the pointer to the input string and the number of available
 | |
| bytes is adjusted.
 | |
| 
 | |
| The only non-obvious thing about @code{mbrtowc} might be the way memory
 | |
| is allocated for the result.  The above code uses the fact that there
 | |
| can never be more wide characters in the converted results than there are
 | |
| bytes in the multibyte input string.  This method yields a pessimistic
 | |
| guess about the size of the result, and if many wide character strings
 | |
| have to be constructed this way or if the strings are long, the extra
 | |
| memory required to be allocated because the input string contains
 | |
| multibyte characters might be significant.  The allocated memory block can
 | |
| be resized to the correct size before returning it, but a better solution
 | |
| might be to allocate just the right amount of space for the result right
 | |
| away.  Unfortunately there is no function to compute the length of the wide
 | |
| character string directly from the multibyte string.  There is, however, a
 | |
| function that does part of the work.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
 | |
| The @code{mbrlen} function (``multibyte restartable length'') computes
 | |
| the number of at most @var{n} bytes starting at @var{s}, which form the
 | |
| next valid and complete multibyte character.
 | |
| 
 | |
| If the next multibyte character corresponds to the NUL wide character,
 | |
| the return value is @math{0}.  If the next @var{n} bytes form a valid
 | |
| multibyte character, the number of bytes belonging to this multibyte
 | |
| character byte sequence is returned.
 | |
| 
 | |
| If the first @var{n} bytes possibly form a valid multibyte
 | |
| character but the character is incomplete, the return value is
 | |
| @code{(size_t) -2}.  Otherwise the multibyte character sequence is invalid
 | |
| and the return value is @code{(size_t) -1}.
 | |
| 
 | |
| The multibyte sequence is interpreted in the state represented by the
 | |
| object pointed to by @var{ps}.  If @var{ps} is a null pointer, a state
 | |
| object local to @code{mbrlen} is used.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{mbrlen} was introduced in @w{Amendment 1} to @w{ISO C90} and
 | |
| is declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| The attentive reader now will note that @code{mbrlen} can be implemented
 | |
| as
 | |
| 
 | |
| @smallexample
 | |
| mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
 | |
| @end smallexample
 | |
| 
 | |
| This is true and in fact is mentioned in the official specification.
 | |
| How can this function be used to determine the length of the wide
 | |
| character string created from a multibyte character string?  It is not
 | |
| directly usable, but we can define a function @code{mbslen} using it:
 | |
| 
 | |
| @smallexample
 | |
| size_t
 | |
| mbslen (const char *s)
 | |
| @{
 | |
|   mbstate_t state;
 | |
|   size_t result = 0;
 | |
|   size_t nbytes;
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
 | |
|     @{
 | |
|       if (nbytes >= (size_t) -2)
 | |
|         /* @r{Something is wrong.}  */
 | |
|         return (size_t) -1;
 | |
|       s += nbytes;
 | |
|       ++result;
 | |
|     @}
 | |
|   return result;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| This function simply calls @code{mbrlen} for each multibyte character
 | |
| in the string and counts the number of function calls.  Please note that
 | |
| we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
 | |
| call.  This is acceptable since a) this value is larger then the length of
 | |
| the longest multibyte character sequence and b) we know that the string
 | |
| @var{s} ends with a NUL byte, which cannot be part of any other multibyte
 | |
| character sequence but the one representing the NUL wide character.
 | |
| Therefore, the @code{mbrlen} function will never read invalid memory.
 | |
| 
 | |
| Now that this function is available (just to make this clear, this
 | |
| function is @emph{not} part of the GNU C library) we can compute the
 | |
| number of wide character required to store the converted multibyte
 | |
| character string @var{s} using
 | |
| 
 | |
| @smallexample
 | |
| wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
 | |
| @end smallexample
 | |
| 
 | |
| Please note that the @code{mbslen} function is quite inefficient.  The
 | |
| implementation of @code{mbstouwcs} with @code{mbslen} would have to
 | |
| perform the conversion of the multibyte character input string twice, and
 | |
| this conversion might be quite expensive.  So it is necessary to think
 | |
| about the consequences of using the easier but imprecise method before
 | |
| doing the work twice.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
 | |
| The @code{wcrtomb} function (``wide character restartable to
 | |
| multibyte'') converts a single wide character into a multibyte string
 | |
| corresponding to that wide character.
 | |
| 
 | |
| If @var{s} is a null pointer, the function resets the state stored in
 | |
| the objects pointed to by @var{ps} (or the internal @code{mbstate_t}
 | |
| object) to the initial state.  This can also be achieved by a call like
 | |
| this:
 | |
| 
 | |
| @smallexample
 | |
| wcrtombs (temp_buf, L'\0', ps)
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| since, if @var{s} is a null pointer, @code{wcrtomb} performs as if it
 | |
| writes into an internal buffer, which is guaranteed to be large enough.
 | |
| 
 | |
| If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if
 | |
| necessary, a shift sequence to get the state @var{ps} into the initial
 | |
| state followed by a single NUL byte, which is stored in the string
 | |
| @var{s}.
 | |
| 
 | |
| Otherwise a byte sequence (possibly including shift sequences) is written
 | |
| into the string @var{s}.  This only happens if @var{wc} is a valid wide
 | |
| character (i.e., it has a multibyte representation in the character set
 | |
| selected by locale of the @code{LC_CTYPE} category).  If @var{wc} is no
 | |
| valid wide character, nothing is stored in the strings @var{s},
 | |
| @code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps}
 | |
| is undefined and the return value is @code{(size_t) -1}.
 | |
| 
 | |
| If no error occurred the function returns the number of bytes stored in
 | |
| the string @var{s}.  This includes all bytes representing shift
 | |
| sequences.
 | |
| 
 | |
| One word about the interface of the function: there is no parameter
 | |
| specifying the length of the array @var{s}.  Instead the function
 | |
| assumes that there are at least @code{MB_CUR_MAX} bytes available since
 | |
| this is the maximum length of any byte sequence representing a single
 | |
| character.  So the caller has to make sure that there is enough space
 | |
| available, otherwise buffer overruns can occur.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
 | |
| declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| Using @code{wcrtomb} is as easy as using @code{mbrtowc}.  The following
 | |
| example appends a wide character string to a multibyte character string.
 | |
| Again, the code is not really useful (or correct), it is simply here to
 | |
| demonstrate the use and some problems.
 | |
| 
 | |
| @smallexample
 | |
| char *
 | |
| mbscatwcs (char *s, size_t len, const wchar_t *ws)
 | |
| @{
 | |
|   mbstate_t state;
 | |
|   /* @r{Find the end of the existing string.}  */
 | |
|   char *wp = strchr (s, '\0');
 | |
|   len -= wp - s;
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   do
 | |
|     @{
 | |
|       size_t nbytes;
 | |
|       if (len < MB_CUR_LEN)
 | |
|         @{
 | |
|           /* @r{We cannot guarantee that the next}
 | |
|              @r{character fits into the buffer, so}
 | |
|              @r{return an error.}  */
 | |
|           errno = E2BIG;
 | |
|           return NULL;
 | |
|         @}
 | |
|       nbytes = wcrtomb (wp, *ws, &state);
 | |
|       if (nbytes == (size_t) -1)
 | |
|         /* @r{Error in the conversion.}  */
 | |
|         return NULL;
 | |
|       len -= nbytes;
 | |
|       wp += nbytes;
 | |
|     @}
 | |
|   while (*ws++ != L'\0');
 | |
|   return s;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| First the function has to find the end of the string currently in the
 | |
| array @var{s}.  The @code{strchr} call does this very efficiently since a
 | |
| requirement for multibyte character representations is that the NUL byte
 | |
| is never used except to represent itself (and in this context, the end
 | |
| of the string).
 | |
| 
 | |
| After initializing the state object the loop is entered where the first
 | |
| task is to make sure there is enough room in the array @var{s}.  We
 | |
| abort if there are not at least @code{MB_CUR_LEN} bytes available.  This
 | |
| is not always optimal but we have no other choice.  We might have less
 | |
| than @code{MB_CUR_LEN} bytes available but the next multibyte character
 | |
| might also be only one byte long.  At the time the @code{wcrtomb} call
 | |
| returns it is too late to decide whether the buffer was large enough.  If
 | |
| this solution is unsuitable, there is a very slow but more accurate
 | |
| solution.
 | |
| 
 | |
| @smallexample
 | |
|   @dots{}
 | |
|   if (len < MB_CUR_LEN)
 | |
|     @{
 | |
|       mbstate_t temp_state;
 | |
|       memcpy (&temp_state, &state, sizeof (state));
 | |
|       if (wcrtomb (NULL, *ws, &temp_state) > len)
 | |
|         @{
 | |
|           /* @r{We cannot guarantee that the next}
 | |
|              @r{character fits into the buffer, so}
 | |
|              @r{return an error.}  */
 | |
|           errno = E2BIG;
 | |
|           return NULL;
 | |
|         @}
 | |
|     @}
 | |
|   @dots{}
 | |
| @end smallexample
 | |
| 
 | |
| Here we perform the conversion that might overflow the buffer so that
 | |
| we are afterwards in the position to make an exact decision about the
 | |
| buffer size.  Please note the @code{NULL} argument for the destination
 | |
| buffer in the new @code{wcrtomb} call; since we are not interested in the
 | |
| converted text at this point, this is a nice way to express this.  The
 | |
| most unusual thing about this piece of code certainly is the duplication
 | |
| of the conversion state object, but if a change of the state is necessary
 | |
| to emit the next multibyte character, we want to have the same shift state
 | |
| change performed in the real conversion.  Therefore, we have to preserve
 | |
| the initial shift state information.
 | |
| 
 | |
| There are certainly many more and even better solutions to this problem.
 | |
| This example is only provided for educational purposes.
 | |
| 
 | |
| @node Converting Strings
 | |
| @subsection Converting Multibyte and Wide Character Strings
 | |
| 
 | |
| The functions described in the previous section only convert a single
 | |
| character at a time.  Most operations to be performed in real-world
 | |
| programs include strings and therefore the @w{ISO C} standard also
 | |
| defines conversions on entire strings.  However, the defined set of
 | |
| functions is quite limited; therefore, the GNU C library contains a few
 | |
| extensions that can help in some important situations.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
 | |
| The @code{mbsrtowcs} function (``multibyte string restartable to wide
 | |
| character string'') converts an NUL-terminated multibyte character
 | |
| string at @code{*@var{src}} into an equivalent wide character string,
 | |
| including the NUL wide character at the end.  The conversion is started
 | |
| using the state information from the object pointed to by @var{ps} or
 | |
| from an internal object of @code{mbsrtowcs} if @var{ps} is a null
 | |
| pointer.  Before returning, the state object is updated to match the state
 | |
| after the last converted character.  The state is the initial state if the
 | |
| terminating NUL byte is reached and converted.
 | |
| 
 | |
| If @var{dst} is not a null pointer, the result is stored in the array
 | |
| pointed to by @var{dst}; otherwise, the conversion result is not
 | |
| available since it is stored in an internal buffer.
 | |
| 
 | |
| If @var{len} wide characters are stored in the array @var{dst} before
 | |
| reaching the end of the input string, the conversion stops and @var{len}
 | |
| is returned.  If @var{dst} is a null pointer, @var{len} is never checked.
 | |
| 
 | |
| Another reason for a premature return from the function call is if the
 | |
| input string contains an invalid multibyte sequence.  In this case the
 | |
| global variable @code{errno} is set to @code{EILSEQ} and the function
 | |
| returns @code{(size_t) -1}.
 | |
| 
 | |
| @c XXX The ISO C9x draft seems to have a problem here.  It says that PS
 | |
| @c is not updated if DST is NULL.  This is not said straightforward and
 | |
| @c none of the other functions is described like this.  It would make sense
 | |
| @c to define the function this way but I don't think it is meant like this.
 | |
| 
 | |
| In all other cases the function returns the number of wide characters
 | |
| converted during this call.  If @var{dst} is not null, @code{mbsrtowcs}
 | |
| stores in the pointer pointed to by @var{src} either a null pointer (if
 | |
| the NUL byte in the input string was reached) or the address of the byte
 | |
| following the last converted multibyte character.
 | |
| 
 | |
| @pindex wchar.h
 | |
| @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
 | |
| declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| The definition of the @code{mbsrtowcs} function has one important
 | |
| limitation.  The requirement that @var{dst} has to be a NUL-terminated
 | |
| string provides problems if one wants to convert buffers with text.  A
 | |
| buffer is normally no collection of NUL-terminated strings but instead a
 | |
| continuous collection of lines, separated by newline characters.  Now
 | |
| assume that a function to convert one line from a buffer is needed.  Since
 | |
| the line is not NUL-terminated, the source pointer cannot directly point
 | |
| into the unmodified text buffer.  This means, either one inserts the NUL
 | |
| byte at the appropriate place for the time of the @code{mbsrtowcs}
 | |
| function call (which is not doable for a read-only buffer or in a
 | |
| multi-threaded application) or one copies the line in an extra buffer
 | |
| where it can be terminated by a NUL byte.  Note that it is not in general
 | |
| possible to limit the number of characters to convert by setting the
 | |
| parameter @var{len} to any specific value.  Since it is not known how
 | |
| many bytes each multibyte character sequence is in length, one can only
 | |
| guess.
 | |
| 
 | |
| @cindex stateful
 | |
| There is still a problem with the method of NUL-terminating a line right
 | |
| after the newline character, which could lead to very strange results.
 | |
| As said in the description of the @code{mbsrtowcs} function above the
 | |
| conversion state is guaranteed to be in the initial shift state after
 | |
| processing the NUL byte at the end of the input string.  But this NUL
 | |
| byte is not really part of the text (i.e., the conversion state after
 | |
| the newline in the original text could be something different than the
 | |
| initial shift state and therefore the first character of the next line
 | |
| is encoded using this state).  But the state in question is never
 | |
| accessible to the user since the conversion stops after the NUL byte
 | |
| (which resets the state).  Most stateful character sets in use today
 | |
| require that the shift state after a newline be the initial state--but
 | |
| this is not a strict guarantee.  Therefore, simply NUL-terminating a
 | |
| piece of a running text is not always an adequate solution and,
 | |
| therefore, should never be used in generally used code.
 | |
| 
 | |
| The generic conversion interface (@pxref{Generic Charset Conversion})
 | |
| does not have this limitation (it simply works on buffers, not
 | |
| strings), and the GNU C library contains a set of functions that take
 | |
| additional parameters specifying the maximal number of bytes that are
 | |
| consumed from the input string.  This way the problem of
 | |
| @code{mbsrtowcs}'s example above could be solved by determining the line
 | |
| length and passing this length to the function.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment ISO
 | |
| @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
 | |
| The @code{wcsrtombs} function (``wide character string restartable to
 | |
| multibyte string'') converts the NUL-terminated wide character string at
 | |
| @code{*@var{src}} into an equivalent multibyte character string and
 | |
| stores the result in the array pointed to by @var{dst}.  The NUL wide
 | |
| character is also converted.  The conversion starts in the state
 | |
| described in the object pointed to by @var{ps} or by a state object
 | |
| locally to @code{wcsrtombs} in case @var{ps} is a null pointer.  If
 | |
| @var{dst} is a null pointer, the conversion is performed as usual but the
 | |
| result is not available.  If all characters of the input string were
 | |
| successfully converted and if @var{dst} is not a null pointer, the
 | |
| pointer pointed to by @var{src} gets assigned a null pointer.
 | |
| 
 | |
| If one of the wide characters in the input string has no valid multibyte
 | |
| character equivalent, the conversion stops early, sets the global
 | |
| variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
 | |
| 
 | |
| Another reason for a premature stop is if @var{dst} is not a null
 | |
| pointer and the next converted character would require more than
 | |
| @var{len} bytes in total to the array @var{dst}.  In this case (and if
 | |
| @var{dest} is not a null pointer) the pointer pointed to by @var{src} is
 | |
| assigned a value pointing to the wide character right after the last one
 | |
| successfully converted.
 | |
| 
 | |
| Except in the case of an encoding error the return value of the
 | |
| @code{wcsrtombs} function is the number of bytes in all the multibyte
 | |
| character sequences stored in @var{dst}.  Before returning the state in
 | |
| the object pointed to by @var{ps} (or the internal object in case
 | |
| @var{ps} is a null pointer) is updated to reflect the state after the
 | |
| last conversion.  The state is the initial shift state in case the
 | |
| terminating NUL wide character was converted.
 | |
| 
 | |
| @pindex wchar.h
 | |
| The @code{wcsrtombs} function was introduced in @w{Amendment 1} to
 | |
| @w{ISO C90} and is declared in @file{wchar.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| The restriction mentioned above for the @code{mbsrtowcs} function applies
 | |
| here also.  There is no possibility of directly controlling the number of
 | |
| input characters.  One has to place the NUL wide character at the correct
 | |
| place or control the consumed input indirectly via the available output
 | |
| array size (the @var{len} parameter).
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment GNU
 | |
| @deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
 | |
| The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
 | |
| function.  All the parameters are the same except for @var{nmc}, which is
 | |
| new.  The return value is the same as for @code{mbsrtowcs}.
 | |
| 
 | |
| This new parameter specifies how many bytes at most can be used from the
 | |
| multibyte character string.  In other words, the multibyte character
 | |
| string @code{*@var{src}} need not be NUL-terminated.  But if a NUL byte
 | |
| is found within the @var{nmc} first bytes of the string, the conversion
 | |
| stops here.
 | |
| 
 | |
| This function is a GNU extension.  It is meant to work around the
 | |
| problems mentioned above.  Now it is possible to convert a buffer with
 | |
| multibyte character text piece for piece without having to care about
 | |
| inserting NUL bytes and the effect of NUL bytes on the conversion state.
 | |
| @end deftypefun
 | |
| 
 | |
| A function to convert a multibyte string into a wide character string
 | |
| and display it could be written like this (this is not a really useful
 | |
| example):
 | |
| 
 | |
| @smallexample
 | |
| void
 | |
| showmbs (const char *src, FILE *fp)
 | |
| @{
 | |
|   mbstate_t state;
 | |
|   int cnt = 0;
 | |
|   memset (&state, '\0', sizeof (state));
 | |
|   while (1)
 | |
|     @{
 | |
|       wchar_t linebuf[100];
 | |
|       const char *endp = strchr (src, '\n');
 | |
|       size_t n;
 | |
| 
 | |
|       /* @r{Exit if there is no more line.}  */
 | |
|       if (endp == NULL)
 | |
|         break;
 | |
| 
 | |
|       n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
 | |
|       linebuf[n] = L'\0';
 | |
|       fprintf (fp, "line %d: \"%S\"\n", linebuf);
 | |
|     @}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| There is no problem with the state after a call to @code{mbsnrtowcs}.
 | |
| Since we don't insert characters in the strings that were not in there
 | |
| right from the beginning and we use @var{state} only for the conversion
 | |
| of the given buffer, there is no problem with altering the state.
 | |
| 
 | |
| @comment wchar.h
 | |
| @comment GNU
 | |
| @deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
 | |
| The @code{wcsnrtombs} function implements the conversion from wide
 | |
| character strings to multibyte character strings.  It is similar to
 | |
| @code{wcsrtombs} but, just like @code{mbsnrtowcs}, it takes an extra
 | |
| parameter, which specifies the length of the input string.
 | |
| 
 | |
| No more than @var{nwc} wide characters from the input string
 | |
| @code{*@var{src}} are converted.  If the input string contains a NUL
 | |
| wide character in the first @var{nwc} characters, the conversion stops at
 | |
| this place.
 | |
| 
 | |
| The @code{wcsnrtombs} function is a GNU extension and just like
 | |
| @code{mbsnrtowcs} helps in situations where no NUL-terminated input
 | |
| strings are available.
 | |
| @end deftypefun
 | |
| 
 | |
| 
 | |
| @node Multibyte Conversion Example
 | |
| @subsection A Complete Multibyte Conversion Example
 | |
| 
 | |
| The example programs given in the last sections are only brief and do
 | |
| not contain all the error checking, etc.  Presented here is a complete
 | |
| and documented example.  It features the @code{mbrtowc} function but it
 | |
| should be easy to derive versions using the other functions.
 | |
| 
 | |
| @smallexample
 | |
| int
 | |
| file_mbsrtowcs (int input, int output)
 | |
| @{
 | |
|   /* @r{Note the use of @code{MB_LEN_MAX}.}
 | |
|      @r{@code{MB_CUR_MAX} cannot portably be used here.}  */
 | |
|   char buffer[BUFSIZ + MB_LEN_MAX];
 | |
|   mbstate_t state;
 | |
|   int filled = 0;
 | |
|   int eof = 0;
 | |
| 
 | |
|   /* @r{Initialize the state.}  */
 | |
|   memset (&state, '\0', sizeof (state));
 | |
| 
 | |
|   while (!eof)
 | |
|     @{
 | |
|       ssize_t nread;
 | |
|       ssize_t nwrite;
 | |
|       char *inp = buffer;
 | |
|       wchar_t outbuf[BUFSIZ];
 | |
|       wchar_t *outp = outbuf;
 | |
| 
 | |
|       /* @r{Fill up the buffer from the input file.}  */
 | |
|       nread = read (input, buffer + filled, BUFSIZ);
 | |
|       if (nread < 0)
 | |
|         @{
 | |
|           perror ("read");
 | |
|           return 0;
 | |
|         @}
 | |
|       /* @r{If we reach end of file, make a note to read no more.} */
 | |
|       if (nread == 0)
 | |
|         eof = 1;
 | |
| 
 | |
|       /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
 | |
|       filled += nread;
 | |
| 
 | |
|       /* @r{Convert those bytes to wide characters--as many as we can.} */
 | |
|       while (1)
 | |
|         @{
 | |
|           size_t thislen = mbrtowc (outp, inp, filled, &state);
 | |
|           /* @r{Stop converting at invalid character;}
 | |
|              @r{this can mean we have read just the first part}
 | |
|              @r{of a valid character.}  */
 | |
|           if (thislen == (size_t) -1)
 | |
|             break;
 | |
|           /* @r{We want to handle embedded NUL bytes}
 | |
|              @r{but the return value is 0.  Correct this.}  */
 | |
|           if (thislen == 0)
 | |
|             thislen = 1;
 | |
|           /* @r{Advance past this character.} */
 | |
|           inp += thislen;
 | |
|           filled -= thislen;
 | |
|           ++outp;
 | |
|         @}
 | |
| 
 | |
|       /* @r{Write the wide characters we just made.}  */
 | |
|       nwrite = write (output, outbuf,
 | |
|                       (outp - outbuf) * sizeof (wchar_t));
 | |
|       if (nwrite < 0)
 | |
|         @{
 | |
|           perror ("write");
 | |
|           return 0;
 | |
|         @}
 | |
| 
 | |
|       /* @r{See if we have a @emph{real} invalid character.} */
 | |
|       if ((eof && filled > 0) || filled >= MB_CUR_MAX)
 | |
|         @{
 | |
|           error (0, 0, "invalid multibyte character");
 | |
|           return 0;
 | |
|         @}
 | |
| 
 | |
|       /* @r{If any characters must be carried forward,}
 | |
|          @r{put them at the beginning of @code{buffer}.} */
 | |
|       if (filled > 0)
 | |
|         memmove (buffer, inp, filled);
 | |
|     @}
 | |
| 
 | |
|   return 1;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| 
 | |
| @node Non-reentrant Conversion
 | |
| @section Non-reentrant Conversion Function
 | |
| 
 | |
| The functions described in the previous chapter are defined in
 | |
| @w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard
 | |
| also contained functions for character set conversion.  The reason that
 | |
| these original functions are not described first is that they are almost
 | |
| entirely useless.
 | |
| 
 | |
| The problem is that all the conversion functions described in the
 | |
| original @w{ISO C90} use a local state.  Using a local state implies that
 | |
| multiple conversions at the same time (not only when using threads)
 | |
| cannot be done, and that you cannot first convert single characters and
 | |
| then strings since you cannot tell the conversion functions which state
 | |
| to use.
 | |
| 
 | |
| These original functions are therefore usable only in a very limited set
 | |
| of situations.  One must complete converting the entire string before
 | |
| starting a new one, and each string/text must be converted with the same
 | |
| function (there is no problem with the library itself; it is guaranteed
 | |
| that no library function changes the state of any of these functions).
 | |
| @strong{For the above reasons it is highly requested that the functions
 | |
| described in the previous section be used in place of non-reentrant
 | |
| conversion functions.}
 | |
| 
 | |
| @menu
 | |
| * Non-reentrant Character Conversion::  Non-reentrant Conversion of Single
 | |
|                                          Characters.
 | |
| * Non-reentrant String Conversion::     Non-reentrant Conversion of Strings.
 | |
| * Shift State::                         States in Non-reentrant Functions.
 | |
| @end menu
 | |
| 
 | |
| @node Non-reentrant Character Conversion
 | |
| @subsection Non-reentrant Conversion of Single Characters
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypefun int mbtowc (wchar_t *restrict @var{result}, const char *restrict @var{string}, size_t @var{size})
 | |
| The @code{mbtowc} (``multibyte to wide character'') function when called
 | |
| with non-null @var{string} converts the first multibyte character
 | |
| beginning at @var{string} to its corresponding wide character code.  It
 | |
| stores the result in @code{*@var{result}}.
 | |
| 
 | |
| @code{mbtowc} never examines more than @var{size} bytes.  (The idea is
 | |
| to supply for @var{size} the number of bytes of data you have in hand.)
 | |
| 
 | |
| @code{mbtowc} with non-null @var{string} distinguishes three
 | |
| possibilities: the first @var{size} bytes at @var{string} start with
 | |
| valid multibyte characters, they start with an invalid byte sequence or
 | |
| just part of a character, or @var{string} points to an empty string (a
 | |
| null character).
 | |
| 
 | |
| For a valid multibyte character, @code{mbtowc} converts it to a wide
 | |
| character and stores that in @code{*@var{result}}, and returns the
 | |
| number of bytes in that character (always at least @math{1} and never
 | |
| more than @var{size}).
 | |
| 
 | |
| For an invalid byte sequence, @code{mbtowc} returns @math{-1}.  For an
 | |
| empty string, it returns @math{0}, also storing @code{'\0'} in
 | |
| @code{*@var{result}}.
 | |
| 
 | |
| If the multibyte character code uses shift characters, then
 | |
| @code{mbtowc} maintains and updates a shift state as it scans.  If you
 | |
| call @code{mbtowc} with a null pointer for @var{string}, that
 | |
| initializes the shift state to its standard initial value.  It also
 | |
| returns nonzero if the multibyte character code in use actually has a
 | |
| shift state.  @xref{Shift State}.
 | |
| @end deftypefun
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
 | |
| The @code{wctomb} (``wide character to multibyte'') function converts
 | |
| the wide character code @var{wchar} to its corresponding multibyte
 | |
| character sequence, and stores the result in bytes starting at
 | |
| @var{string}.  At most @code{MB_CUR_MAX} characters are stored.
 | |
| 
 | |
| @code{wctomb} with non-null @var{string} distinguishes three
 | |
| possibilities for @var{wchar}: a valid wide character code (one that can
 | |
| be translated to a multibyte character), an invalid code, and
 | |
| @code{L'\0'}.
 | |
| 
 | |
| Given a valid code, @code{wctomb} converts it to a multibyte character,
 | |
| storing the bytes starting at @var{string}.  Then it returns the number
 | |
| of bytes in that character (always at least @math{1} and never more
 | |
| than @code{MB_CUR_MAX}).
 | |
| 
 | |
| If @var{wchar} is an invalid wide character code, @code{wctomb} returns
 | |
| @math{-1}.  If @var{wchar} is @code{L'\0'}, it returns @code{0}, also
 | |
| storing @code{'\0'} in @code{*@var{string}}.
 | |
| 
 | |
| If the multibyte character code uses shift characters, then
 | |
| @code{wctomb} maintains and updates a shift state as it scans.  If you
 | |
| call @code{wctomb} with a null pointer for @var{string}, that
 | |
| initializes the shift state to its standard initial value.  It also
 | |
| returns nonzero if the multibyte character code in use actually has a
 | |
| shift state.  @xref{Shift State}.
 | |
| 
 | |
| Calling this function with a @var{wchar} argument of zero when
 | |
| @var{string} is not null has the side-effect of reinitializing the
 | |
| stored shift state @emph{as well as} storing the multibyte character
 | |
| @code{'\0'} and returning @math{0}.
 | |
| @end deftypefun
 | |
| 
 | |
| Similar to @code{mbrlen} there is also a non-reentrant function that
 | |
| computes the length of a multibyte character.  It can be defined in
 | |
| terms of @code{mbtowc}.
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypefun int mblen (const char *@var{string}, size_t @var{size})
 | |
| The @code{mblen} function with a non-null @var{string} argument returns
 | |
| the number of bytes that make up the multibyte character beginning at
 | |
| @var{string}, never examining more than @var{size} bytes.  (The idea is
 | |
| to supply for @var{size} the number of bytes of data you have in hand.)
 | |
| 
 | |
| The return value of @code{mblen} distinguishes three possibilities: the
 | |
| first @var{size} bytes at @var{string} start with valid multibyte
 | |
| characters, they start with an invalid byte sequence or just part of a
 | |
| character, or @var{string} points to an empty string (a null character).
 | |
| 
 | |
| For a valid multibyte character, @code{mblen} returns the number of
 | |
| bytes in that character (always at least @code{1} and never more than
 | |
| @var{size}).  For an invalid byte sequence, @code{mblen} returns
 | |
| @math{-1}.  For an empty string, it returns @math{0}.
 | |
| 
 | |
| If the multibyte character code uses shift characters, then @code{mblen}
 | |
| maintains and updates a shift state as it scans.  If you call
 | |
| @code{mblen} with a null pointer for @var{string}, that initializes the
 | |
| shift state to its standard initial value.  It also returns a nonzero
 | |
| value if the multibyte character code in use actually has a shift state.
 | |
| @xref{Shift State}.
 | |
| 
 | |
| @pindex stdlib.h
 | |
| The function @code{mblen} is declared in @file{stdlib.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| 
 | |
| @node Non-reentrant String Conversion
 | |
| @subsection Non-reentrant Conversion of Strings
 | |
| 
 | |
| For convenience the @w{ISO C90} standard also defines functions to
 | |
| convert entire strings instead of single characters.  These functions
 | |
| suffer from the same problems as their reentrant counterparts from
 | |
| @w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
 | |
| The @code{mbstowcs} (``multibyte string to wide character string'')
 | |
| function converts the null-terminated string of multibyte characters
 | |
| @var{string} to an array of wide character codes, storing not more than
 | |
| @var{size} wide characters into the array beginning at @var{wstring}.
 | |
| The terminating null character counts towards the size, so if @var{size}
 | |
| is less than the actual number of wide characters resulting from
 | |
| @var{string}, no terminating null character is stored.
 | |
| 
 | |
| The conversion of characters from @var{string} begins in the initial
 | |
| shift state.
 | |
| 
 | |
| If an invalid multibyte character sequence is found, the @code{mbstowcs}
 | |
| function returns a value of @math{-1}.  Otherwise, it returns the number
 | |
| of wide characters stored in the array @var{wstring}.  This number does
 | |
| not include the terminating null character, which is present if the
 | |
| number is less than @var{size}.
 | |
| 
 | |
| Here is an example showing how to convert a string of multibyte
 | |
| characters, allocating enough space for the result.
 | |
| 
 | |
| @smallexample
 | |
| wchar_t *
 | |
| mbstowcs_alloc (const char *string)
 | |
| @{
 | |
|   size_t size = strlen (string) + 1;
 | |
|   wchar_t *buf = xmalloc (size * sizeof (wchar_t));
 | |
| 
 | |
|   size = mbstowcs (buf, string, size);
 | |
|   if (size == (size_t) -1)
 | |
|     return NULL;
 | |
|   buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
 | |
|   return buf;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| @end deftypefun
 | |
| 
 | |
| @comment stdlib.h
 | |
| @comment ISO
 | |
| @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
 | |
| The @code{wcstombs} (``wide character string to multibyte string'')
 | |
| function converts the null-terminated wide character array @var{wstring}
 | |
| into a string containing multibyte characters, storing not more than
 | |
| @var{size} bytes starting at @var{string}, followed by a terminating
 | |
| null character if there is room.  The conversion of characters begins in
 | |
| the initial shift state.
 | |
| 
 | |
| The terminating null character counts towards the size, so if @var{size}
 | |
| is less than or equal to the number of bytes needed in @var{wstring}, no
 | |
| terminating null character is stored.
 | |
| 
 | |
| If a code that does not correspond to a valid multibyte character is
 | |
| found, the @code{wcstombs} function returns a value of @math{-1}.
 | |
| Otherwise, the return value is the number of bytes stored in the array
 | |
| @var{string}.  This number does not include the terminating null character,
 | |
| which is present if the number is less than @var{size}.
 | |
| @end deftypefun
 | |
| 
 | |
| @node Shift State
 | |
| @subsection States in Non-reentrant Functions
 | |
| 
 | |
| In some multibyte character codes, the @emph{meaning} of any particular
 | |
| byte sequence is not fixed; it depends on what other sequences have come
 | |
| earlier in the same string.  Typically there are just a few sequences that
 | |
| can change the meaning of other sequences; these few are called
 | |
| @dfn{shift sequences} and we say that they set the @dfn{shift state} for
 | |
| other sequences that follow.
 | |
| 
 | |
| To illustrate shift state and shift sequences, suppose we decide that
 | |
| the sequence @code{0200} (just one byte) enters Japanese mode, in which
 | |
| pairs of bytes in the range from @code{0240} to @code{0377} are single
 | |
| characters, while @code{0201} enters Latin-1 mode, in which single bytes
 | |
| in the range from @code{0240} to @code{0377} are characters, and
 | |
| interpreted according to the ISO Latin-1 character set.  This is a
 | |
| multibyte code that has two alternative shift states (``Japanese mode''
 | |
| and ``Latin-1 mode''), and two shift sequences that specify particular
 | |
| shift states.
 | |
| 
 | |
| When the multibyte character code in use has shift states, then
 | |
| @code{mblen}, @code{mbtowc}, and @code{wctomb} must maintain and update
 | |
| the current shift state as they scan the string.  To make this work
 | |
| properly, you must follow these rules:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| Before starting to scan a string, call the function with a null pointer
 | |
| for the multibyte character address---for example, @code{mblen (NULL,
 | |
| 0)}.  This initializes the shift state to its standard initial value.
 | |
| 
 | |
| @item
 | |
| Scan the string one character at a time, in order.  Do not ``back up''
 | |
| and rescan characters already scanned, and do not intersperse the
 | |
| processing of different strings.
 | |
| @end itemize
 | |
| 
 | |
| Here is an example of using @code{mblen} following these rules:
 | |
| 
 | |
| @smallexample
 | |
| void
 | |
| scan_string (char *s)
 | |
| @{
 | |
|   int length = strlen (s);
 | |
| 
 | |
|   /* @r{Initialize shift state.}  */
 | |
|   mblen (NULL, 0);
 | |
| 
 | |
|   while (1)
 | |
|     @{
 | |
|       int thischar = mblen (s, length);
 | |
|       /* @r{Deal with end of string and invalid characters.}  */
 | |
|       if (thischar == 0)
 | |
|         break;
 | |
|       if (thischar == -1)
 | |
|         @{
 | |
|           error ("invalid multibyte character");
 | |
|           break;
 | |
|         @}
 | |
|       /* @r{Advance past this character.}  */
 | |
|       s += thischar;
 | |
|       length -= thischar;
 | |
|     @}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
 | |
| reentrant when using a multibyte code that uses a shift state.  However,
 | |
| no other library functions call these functions, so you don't have to
 | |
| worry that the shift state will be changed mysteriously.
 | |
| 
 | |
| 
 | |
| @node Generic Charset Conversion
 | |
| @section Generic Charset Conversion
 | |
| 
 | |
| The conversion functions mentioned so far in this chapter all had in
 | |
| common that they operate on character sets that are not directly
 | |
| specified by the functions.  The multibyte encoding used is specified by
 | |
| the currently selected locale for the @code{LC_CTYPE} category.  The
 | |
| wide character set is fixed by the implementation (in the case of GNU C
 | |
| library it is always UCS-4 encoded @w{ISO 10646}.
 | |
| 
 | |
| This has of course several problems when it comes to general character
 | |
| conversion:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| For every conversion where neither the source nor the destination
 | |
| character set is the character set of the locale for the @code{LC_CTYPE}
 | |
| category, one has to change the @code{LC_CTYPE} locale using
 | |
| @code{setlocale}.
 | |
| 
 | |
| Changing the @code{LC_TYPE} locale introduces major problems for the rest
 | |
| of the programs since several more functions (e.g., the character
 | |
| classification functions, @pxref{Classification of Characters}) use the
 | |
| @code{LC_CTYPE} category.
 | |
| 
 | |
| @item
 | |
| Parallel conversions to and from different character sets are not
 | |
| possible since the @code{LC_CTYPE} selection is global and shared by all
 | |
| threads.
 | |
| 
 | |
| @item
 | |
| If neither the source nor the destination character set is the character
 | |
| set used for @code{wchar_t} representation, there is at least a two-step
 | |
| process necessary to convert a text using the functions above.  One would
 | |
| have to select the source character set as the multibyte encoding,
 | |
| convert the text into a @code{wchar_t} text, select the destination
 | |
| character set as the multibyte encoding, and convert the wide character
 | |
| text to the multibyte (@math{=} destination) character set.
 | |
| 
 | |
| Even if this is possible (which is not guaranteed) it is a very tiring
 | |
| work.  Plus it suffers from the other two raised points even more due to
 | |
| the steady changing of the locale.
 | |
| @end itemize
 | |
| 
 | |
| The XPG2 standard defines a completely new set of functions, which has
 | |
| none of these limitations.  They are not at all coupled to the selected
 | |
| locales, and they have no constraints on the character sets selected for
 | |
| source and destination.  Only the set of available conversions limits
 | |
| them.  The standard does not specify that any conversion at all must be
 | |
| available.  Such availability is a measure of the quality of the
 | |
| implementation.
 | |
| 
 | |
| In the following text first the interface to @code{iconv} and then the
 | |
| conversion function, will be described.  Comparisons with other
 | |
| implementations will show what obstacles stand in the way of portable
 | |
| applications.  Finally, the implementation is described in so far as might
 | |
| interest the advanced user who wants to extend conversion capabilities.
 | |
| 
 | |
| @menu
 | |
| * Generic Conversion Interface::    Generic Character Set Conversion Interface.
 | |
| * iconv Examples::                  A complete @code{iconv} example.
 | |
| * Other iconv Implementations::     Some Details about other @code{iconv}
 | |
|                                      Implementations.
 | |
| * glibc iconv Implementation::      The @code{iconv} Implementation in the GNU C
 | |
|                                      library.
 | |
| @end menu
 | |
| 
 | |
| @node Generic Conversion Interface
 | |
| @subsection Generic Character Set Conversion Interface
 | |
| 
 | |
| This set of functions follows the traditional cycle of using a resource:
 | |
| open--use--close.  The interface consists of three functions, each of
 | |
| which implements one step.
 | |
| 
 | |
| Before the interfaces are described it is necessary to introduce a
 | |
| data type.  Just like other open--use--close interfaces the functions
 | |
| introduced here work using handles and the @file{iconv.h} header
 | |
| defines a special type for the handles used.
 | |
| 
 | |
| @comment iconv.h
 | |
| @comment XPG2
 | |
| @deftp {Data Type} iconv_t
 | |
| This data type is an abstract type defined in @file{iconv.h}.  The user
 | |
| must not assume anything about the definition of this type; it must be
 | |
| completely opaque.
 | |
| 
 | |
| Objects of this type can get assigned handles for the conversions using
 | |
| the @code{iconv} functions.  The objects themselves need not be freed, but
 | |
| the conversions for which the handles stand for have to.
 | |
| @end deftp
 | |
| 
 | |
| @noindent
 | |
| The first step is the function to create a handle.
 | |
| 
 | |
| @comment iconv.h
 | |
| @comment XPG2
 | |
| @deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
 | |
| The @code{iconv_open} function has to be used before starting a
 | |
| conversion.  The two parameters this function takes determine the
 | |
| source and destination character set for the conversion, and if the
 | |
| implementation has the possibility to perform such a conversion, the
 | |
| function returns a handle.
 | |
| 
 | |
| If the wanted conversion is not available, the @code{iconv_open} function
 | |
| returns @code{(iconv_t) -1}. In this case the global variable
 | |
| @code{errno} can have the following values:
 | |
| 
 | |
| @table @code
 | |
| @item EMFILE
 | |
| The process already has @code{OPEN_MAX} file descriptors open.
 | |
| @item ENFILE
 | |
| The system limit of open file is reached.
 | |
| @item ENOMEM
 | |
| Not enough memory to carry out the operation.
 | |
| @item EINVAL
 | |
| The conversion from @var{fromcode} to @var{tocode} is not supported.
 | |
| @end table
 | |
| 
 | |
| It is not possible to use the same descriptor in different threads to
 | |
| perform independent conversions.  The data structures associated
 | |
| with the descriptor include information about the conversion state.
 | |
| This must not be messed up by using it in different conversions.
 | |
| 
 | |
| An @code{iconv} descriptor is like a file descriptor as for every use a
 | |
| new descriptor must be created.  The descriptor does not stand for all
 | |
| of the conversions from @var{fromset} to @var{toset}.
 | |
| 
 | |
| The GNU C library implementation of @code{iconv_open} has one
 | |
| significant extension to other implementations.  To ease the extension
 | |
| of the set of available conversions, the implementation allows storing
 | |
| the necessary files with data and code in an arbitrary number of
 | |
| directories.  How this extension must be written will be explained below
 | |
| (@pxref{glibc iconv Implementation}).  Here it is only important to say
 | |
| that all directories mentioned in the @code{GCONV_PATH} environment
 | |
| variable are considered only if they contain a file @file{gconv-modules}.
 | |
| These directories need not necessarily be created by the system
 | |
| administrator.  In fact, this extension is introduced to help users
 | |
| writing and using their own, new conversions.  Of course, this does not
 | |
| work for security reasons in SUID binaries; in this case only the system
 | |
| directory is considered and this normally is
 | |
| @file{@var{prefix}/lib/gconv}.  The @code{GCONV_PATH} environment
 | |
| variable is examined exactly once at the first call of the
 | |
| @code{iconv_open} function.  Later modifications of the variable have no
 | |
| effect.
 | |
| 
 | |
| @pindex iconv.h
 | |
| The @code{iconv_open} function was introduced early in the X/Open
 | |
| Portability Guide, @w{version 2}.  It is supported by all commercial
 | |
| Unices as it is required for the Unix branding.  However, the quality and
 | |
| completeness of the implementation varies widely.  The @code{iconv_open}
 | |
| function is declared in @file{iconv.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| The @code{iconv} implementation can associate large data structure with
 | |
| the handle returned by @code{iconv_open}.  Therefore, it is crucial to
 | |
| free all the resources once all conversions are carried out and the
 | |
| conversion is not needed anymore.
 | |
| 
 | |
| @comment iconv.h
 | |
| @comment XPG2
 | |
| @deftypefun int iconv_close (iconv_t @var{cd})
 | |
| The @code{iconv_close} function frees all resources associated with the
 | |
| handle @var{cd}, which must have been returned by a successful call to
 | |
| the @code{iconv_open} function.
 | |
| 
 | |
| If the function call was successful the return value is @math{0}.
 | |
| Otherwise it is @math{-1} and @code{errno} is set appropriately.
 | |
| Defined error are:
 | |
| 
 | |
| @table @code
 | |
| @item EBADF
 | |
| The conversion descriptor is invalid.
 | |
| @end table
 | |
| 
 | |
| @pindex iconv.h
 | |
| The @code{iconv_close} function was introduced together with the rest
 | |
| of the @code{iconv} functions in XPG2 and is declared in @file{iconv.h}.
 | |
| @end deftypefun
 | |
| 
 | |
| The standard defines only one actual conversion function.  This has,
 | |
| therefore, the most general interface: it allows conversion from one
 | |
| buffer to another.  Conversion from a file to a buffer, vice versa, or
 | |
| even file to file can be implemented on top of it.
 | |
| 
 | |
| @comment iconv.h
 | |
| @comment XPG2
 | |
| @deftypefun size_t iconv (iconv_t @var{cd}, char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
 | |
| @cindex stateful
 | |
| The @code{iconv} function converts the text in the input buffer
 | |
| according to the rules associated with the descriptor @var{cd} and
 | |
| stores the result in the output buffer.  It is possible to call the
 | |
| function for the same text several times in a row since for stateful
 | |
| character sets the necessary state information is kept in the data
 | |
| structures associated with the descriptor.
 | |
| 
 | |
| The input buffer is specified by @code{*@var{inbuf}} and it contains
 | |
| @code{*@var{inbytesleft}} bytes.  The extra indirection is necessary for
 | |
| communicating the used input back to the caller (see below).  It is
 | |
| important to note that the buffer pointer is of type @code{char} and the
 | |
| length is measured in bytes even if the input text is encoded in wide
 | |
| characters.
 | |
| 
 | |
| The output buffer is specified in a similar way.  @code{*@var{outbuf}}
 | |
| points to the beginning of the buffer with at least
 | |
| @code{*@var{outbytesleft}} bytes room for the result.  The buffer
 | |
| pointer again is of type @code{char} and the length is measured in
 | |
| bytes.  If @var{outbuf} or @code{*@var{outbuf}} is a null pointer, the
 | |
| conversion is performed but no output is available.
 | |
| 
 | |
| If @var{inbuf} is a null pointer, the @code{iconv} function performs the
 | |
| necessary action to put the state of the conversion into the initial
 | |
| state.  This is obviously a no-op for non-stateful encodings, but if the
 | |
| encoding has a state, such a function call might put some byte sequences
 | |
| in the output buffer, which perform the necessary state changes.  The
 | |
| next call with @var{inbuf} not being a null pointer then simply goes on
 | |
| from the initial state.  It is important that the programmer never makes
 | |
| any assumption as to whether the conversion has to deal with states.
 | |
| Even if the input and output character sets are not stateful, the
 | |
| implementation might still have to keep states.  This is due to the
 | |
| implementation chosen for the GNU C library as it is described below.
 | |
| Therefore an @code{iconv} call to reset the state should always be
 | |
| performed if some protocol requires this for the output text.
 | |
| 
 | |
| The conversion stops for one of three reasons. The first is that all
 | |
| characters from the input buffer are converted.  This actually can mean
 | |
| two things: either all bytes from the input buffer are consumed or
 | |
| there are some bytes at the end of the buffer that possibly can form a
 | |
| complete character but the input is incomplete.  The second reason for a
 | |
| stop is that the output buffer is full.  And the third reason is that
 | |
| the input contains invalid characters.
 | |
| 
 | |
| In all of these cases the buffer pointers after the last successful
 | |
| conversion, for input and output buffer, are stored in @var{inbuf} and
 | |
| @var{outbuf}, and the available room in each buffer is stored in
 | |
| @var{inbytesleft} and @var{outbytesleft}.
 | |
| 
 | |
| Since the character sets selected in the @code{iconv_open} call can be
 | |
| almost arbitrary, there can be situations where the input buffer contains
 | |
| valid characters, which have no identical representation in the output
 | |
| character set.  The behavior in this situation is undefined.  The
 | |
| @emph{current} behavior of the GNU C library in this situation is to
 | |
| return with an error immediately.  This certainly is not the most
 | |
| desirable solution; therefore, future versions will provide better ones,
 | |
| but they are not yet finished.
 | |
| 
 | |
| If all input from the input buffer is successfully converted and stored
 | |
| in the output buffer, the function returns the number of non-reversible
 | |
| conversions performed.  In all other cases the return value is
 | |
| @code{(size_t) -1} and @code{errno} is set appropriately.  In such cases
 | |
| the value pointed to by @var{inbytesleft} is nonzero.
 | |
| 
 | |
| @table @code
 | |
| @item EILSEQ
 | |
| The conversion stopped because of an invalid byte sequence in the input.
 | |
| After the call, @code{*@var{inbuf}} points at the first byte of the
 | |
| invalid byte sequence.
 | |
| 
 | |
| @item E2BIG
 | |
| The conversion stopped because it ran out of space in the output buffer.
 | |
| 
 | |
| @item EINVAL
 | |
| The conversion stopped because of an incomplete byte sequence at the end
 | |
| of the input buffer.
 | |
| 
 | |
| @item EBADF
 | |
| The @var{cd} argument is invalid.
 | |
| @end table
 | |
| 
 | |
| @pindex iconv.h
 | |
| The @code{iconv} function was introduced in the XPG2 standard and is
 | |
| declared in the @file{iconv.h} header.
 | |
| @end deftypefun
 | |
| 
 | |
| The definition of the @code{iconv} function is quite good overall.  It
 | |
| provides quite flexible functionality.  The only problems lie in the
 | |
| boundary cases, which are incomplete byte sequences at the end of the
 | |
| input buffer and invalid input.  A third problem, which is not really
 | |
| a design problem, is the way conversions are selected.  The standard
 | |
| does not say anything about the legitimate names, a minimal set of
 | |
| available conversions.  We will see how this negatively impacts other
 | |
| implementations, as demonstrated below.
 | |
| 
 | |
| @node iconv Examples
 | |
| @subsection A complete @code{iconv} example
 | |
| 
 | |
| The example below features a solution for a common problem.  Given that
 | |
| one knows the internal encoding used by the system for @code{wchar_t}
 | |
| strings, one often is in the position to read text from a file and store
 | |
| it in wide character buffers.  One can do this using @code{mbsrtowcs},
 | |
| but then we run into the problems discussed above.
 | |
| 
 | |
| @smallexample
 | |
| int
 | |
| file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
 | |
| @{
 | |
|   char inbuf[BUFSIZ];
 | |
|   size_t insize = 0;
 | |
|   char *wrptr = (char *) outbuf;
 | |
|   int result = 0;
 | |
|   iconv_t cd;
 | |
| 
 | |
|   cd = iconv_open ("WCHAR_T", charset);
 | |
|   if (cd == (iconv_t) -1)
 | |
|     @{
 | |
|       /* @r{Something went wrong.}  */
 | |
|       if (errno == EINVAL)
 | |
|         error (0, 0, "conversion from '%s' to wchar_t not available",
 | |
|                charset);
 | |
|       else
 | |
|         perror ("iconv_open");
 | |
| 
 | |
|       /* @r{Terminate the output string.}  */
 | |
|       *outbuf = L'\0';
 | |
| 
 | |
|       return -1;
 | |
|     @}
 | |
| 
 | |
|   while (avail > 0)
 | |
|     @{
 | |
|       size_t nread;
 | |
|       size_t nconv;
 | |
|       char *inptr = inbuf;
 | |
| 
 | |
|       /* @r{Read more input.}  */
 | |
|       nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
 | |
|       if (nread == 0)
 | |
|         @{
 | |
|           /* @r{When we come here the file is completely read.}
 | |
|              @r{This still could mean there are some unused}
 | |
|              @r{characters in the @code{inbuf}.  Put them back.}  */
 | |
|           if (lseek (fd, -insize, SEEK_CUR) == -1)
 | |
|             result = -1;
 | |
| 
 | |
|           /* @r{Now write out the byte sequence to get into the}
 | |
|              @r{initial state if this is necessary.}  */
 | |
|           iconv (cd, NULL, NULL, &wrptr, &avail);
 | |
| 
 | |
|           break;
 | |
|         @}
 | |
|       insize += nread;
 | |
| 
 | |
|       /* @r{Do the conversion.}  */
 | |
|       nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
 | |
|       if (nconv == (size_t) -1)
 | |
|         @{
 | |
|           /* @r{Not everything went right.  It might only be}
 | |
|              @r{an unfinished byte sequence at the end of the}
 | |
|              @r{buffer.  Or it is a real problem.}  */
 | |
|           if (errno == EINVAL)
 | |
|             /* @r{This is harmless.  Simply move the unused}
 | |
|                @r{bytes to the beginning of the buffer so that}
 | |
|                @r{they can be used in the next round.}  */
 | |
|             memmove (inbuf, inptr, insize);
 | |
|           else
 | |
|             @{
 | |
|               /* @r{It is a real problem.  Maybe we ran out of}
 | |
|                  @r{space in the output buffer or we have invalid}
 | |
|                  @r{input.  In any case back the file pointer to}
 | |
|                  @r{the position of the last processed byte.}  */
 | |
|               lseek (fd, -insize, SEEK_CUR);
 | |
|               result = -1;
 | |
|               break;
 | |
|             @}
 | |
|         @}
 | |
|     @}
 | |
| 
 | |
|   /* @r{Terminate the output string.}  */
 | |
|   if (avail >= sizeof (wchar_t))
 | |
|     *((wchar_t *) wrptr) = L'\0';
 | |
| 
 | |
|   if (iconv_close (cd) != 0)
 | |
|     perror ("iconv_close");
 | |
| 
 | |
|   return (wchar_t *) wrptr - outbuf;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| @cindex stateful
 | |
| This example shows the most important aspects of using the @code{iconv}
 | |
| functions.  It shows how successive calls to @code{iconv} can be used to
 | |
| convert large amounts of text.  The user does not have to care about
 | |
| stateful encodings as the functions take care of everything.
 | |
| 
 | |
| An interesting point is the case where @code{iconv} returns an error and
 | |
| @code{errno} is set to @code{EINVAL}.  This is not really an error in the
 | |
| transformation.  It can happen whenever the input character set contains
 | |
| byte sequences of more than one byte for some character and texts are not
 | |
| processed in one piece.  In this case there is a chance that a multibyte
 | |
| sequence is cut.  The caller can then simply read the remainder of the
 | |
| takes and feed the offending bytes together with new character from the
 | |
| input to @code{iconv} and continue the work.  The internal state kept in
 | |
| the descriptor is @emph{not} unspecified after such an event as is the
 | |
| case with the conversion functions from the @w{ISO C} standard.
 | |
| 
 | |
| The example also shows the problem of using wide character strings with
 | |
| @code{iconv}.  As explained in the description of the @code{iconv}
 | |
| function above, the function always takes a pointer to a @code{char}
 | |
| array and the available space is measured in bytes.  In the example, the
 | |
| output buffer is a wide character buffer; therefore, we use a local
 | |
| variable @var{wrptr} of type @code{char *}, which is used in the
 | |
| @code{iconv} calls.
 | |
| 
 | |
| This looks rather innocent but can lead to problems on platforms that
 | |
| have tight restriction on alignment.  Therefore the caller of @code{iconv}
 | |
| has to make sure that the pointers passed are suitable for access of
 | |
| characters from the appropriate character set.  Since, in the
 | |
| above case, the input parameter to the function is a @code{wchar_t}
 | |
| pointer, this is the case (unless the user violates alignment when
 | |
| computing the parameter).  But in other situations, especially when
 | |
| writing generic functions where one does not know what type of character
 | |
| set one uses and, therefore, treats text as a sequence of bytes, it might
 | |
| become tricky.
 | |
| 
 | |
| @node Other iconv Implementations
 | |
| @subsection Some Details about other @code{iconv} Implementations
 | |
| 
 | |
| This is not really the place to discuss the @code{iconv} implementation
 | |
| of other systems but it is necessary to know a bit about them to write
 | |
| portable programs.  The above mentioned problems with the specification
 | |
| of the @code{iconv} functions can lead to portability issues.
 | |
| 
 | |
| The first thing to notice is that, due to the large number of character
 | |
| sets in use, it is certainly not practical to encode the conversions
 | |
| directly in the C library.  Therefore, the conversion information must
 | |
| come from files outside the C library.  This is usually done in one or
 | |
| both of the following ways:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| The C library contains a set of generic conversion functions that can
 | |
| read the needed conversion tables and other information from data files.
 | |
| These files get loaded when necessary.
 | |
| 
 | |
| This solution is problematic as it requires a great deal of effort to
 | |
| apply to all character sets (potentially an infinite set).  The
 | |
| differences in the structure of the different character sets is so large
 | |
| that many different variants of the table-processing functions must be
 | |
| developed.  In addition, the generic nature of these functions make them
 | |
| slower than specifically implemented functions.
 | |
| 
 | |
| @item
 | |
| The C library only contains a framework that can dynamically load
 | |
| object files and execute the conversion functions contained therein.
 | |
| 
 | |
| This solution provides much more flexibility.  The C library itself
 | |
| contains only very little code and therefore reduces the general memory
 | |
| footprint.  Also, with a documented interface between the C library and
 | |
| the loadable modules it is possible for third parties to extend the set
 | |
| of available conversion modules.  A drawback of this solution is that
 | |
| dynamic loading must be available.
 | |
| @end itemize
 | |
| 
 | |
| Some implementations in commercial Unices implement a mixture of these
 | |
| possibilities; the majority implement only the second solution.  Using
 | |
| loadable modules moves the code out of the library itself and keeps
 | |
| the door open for extensions and improvements, but this design is also
 | |
| limiting on some platforms since not many platforms support dynamic
 | |
| loading in statically linked programs.  On platforms without this
 | |
| capability it is therefore not possible to use this interface in
 | |
| statically linked programs.  The GNU C library has, on ELF platforms, no
 | |
| problems with dynamic loading in these situations; therefore, this
 | |
| point is moot.  The danger is that one gets acquainted with this
 | |
| situation and forgets about the restrictions on other systems.
 | |
| 
 | |
| A second thing to know about other @code{iconv} implementations is that
 | |
| the number of available conversions is often very limited.  Some
 | |
| implementations provide, in the standard release (not special
 | |
| international or developer releases), at most 100 to 200 conversion
 | |
| possibilities.  This does not mean 200 different character sets are
 | |
| supported; for example, conversions from one character set to a set of 10
 | |
| others might count as 10 conversions.  Together with the other direction
 | |
| this makes 20 conversion possibilities used up by one character set.  One
 | |
| can imagine the thin coverage these platform provide.  Some Unix vendors
 | |
| even provide only a handful of conversions, which renders them useless for
 | |
| almost all uses.
 | |
| 
 | |
| This directly leads to a third and probably the most problematic point.
 | |
| The way the @code{iconv} conversion functions are implemented on all
 | |
| known Unix systems and the availability of the conversion functions from
 | |
| character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
 | |
| @math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
 | |
| conversion from @math{@cal{A}} to @math{@cal{C}} is available.
 | |
| 
 | |
| This might not seem unreasonable and problematic at first, but it is a
 | |
| quite big problem as one will notice shortly after hitting it.  To show
 | |
| the problem we assume to write a program that has to convert from
 | |
| @math{@cal{A}} to @math{@cal{C}}.  A call like
 | |
| 
 | |
| @smallexample
 | |
| cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| fails according to the assumption above.  But what does the program
 | |
| do now?  The conversion is necessary; therefore, simply giving up is not
 | |
| an option.
 | |
| 
 | |
| This is a nuisance.  The @code{iconv} function should take care of this.
 | |
| But how should the program proceed from here on?  If it tries to convert
 | |
| to character set @math{@cal{B}}, first the two @code{iconv_open}
 | |
| calls
 | |
| 
 | |
| @smallexample
 | |
| cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| and
 | |
| 
 | |
| @smallexample
 | |
| cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| will succeed, but how to find @math{@cal{B}}?
 | |
| 
 | |
| Unfortunately, the answer is: there is no general solution.  On some
 | |
| systems guessing might help.  On those systems most character sets can
 | |
| convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside
 | |
| this only some very system-specific methods can help.  Since the
 | |
| conversion functions come from loadable modules and these modules must
 | |
| be stored somewhere in the filesystem, one @emph{could} try to find them
 | |
| and determine from the available file which conversions are available
 | |
| and whether there is an indirect route from @math{@cal{A}} to
 | |
| @math{@cal{C}}.
 | |
| 
 | |
| This example shows one of the design errors of @code{iconv} mentioned
 | |
| above.  It should at least be possible to determine the list of available
 | |
| conversion programmatically so that if @code{iconv_open} says there is no
 | |
| such conversion, one could make sure this also is true for indirect
 | |
| routes.
 | |
| 
 | |
| @node glibc iconv Implementation
 | |
| @subsection The @code{iconv} Implementation in the GNU C library
 | |
| 
 | |
| After reading about the problems of @code{iconv} implementations in the
 | |
| last section it is certainly good to note that the implementation in
 | |
| the GNU C library has none of the problems mentioned above.  What
 | |
| follows is a step-by-step analysis of the points raised above.  The
 | |
| evaluation is based on the current state of the development (as of
 | |
| January 1999).  The development of the @code{iconv} functions is not
 | |
| complete, but basic functionality has solidified.
 | |
| 
 | |
| The GNU C library's @code{iconv} implementation uses shared loadable
 | |
| modules to implement the conversions.  A very small number of
 | |
| conversions are built into the library itself but these are only rather
 | |
| trivial conversions.
 | |
| 
 | |
| All the benefits of loadable modules are available in the GNU C library
 | |
| implementation.  This is especially appealing since the interface is
 | |
| well documented (see below), and it, therefore, is easy to write new
 | |
| conversion modules.  The drawback of using loadable objects is not a
 | |
| problem in the GNU C library, at least on ELF systems.  Since the
 | |
| library is able to load shared objects even in statically linked
 | |
| binaries, static linking need not be forbidden in case one wants to use
 | |
| @code{iconv}.
 | |
| 
 | |
| The second mentioned problem is the number of supported conversions.
 | |
| Currently, the GNU C library supports more than 150 character sets.  The
 | |
| way the implementation is designed the number of supported conversions
 | |
| is greater than 22350 (@math{150} times @math{149}).  If any conversion
 | |
| from or to a character set is missing, it can be added easily.
 | |
| 
 | |
| Particularly impressive as it may be, this high number is due to the
 | |
| fact that the GNU C library implementation of @code{iconv} does not have
 | |
| the third problem mentioned above (i.e., whenever there is a conversion
 | |
| from a character set @math{@cal{A}} to @math{@cal{B}} and from
 | |
| @math{@cal{B}} to @math{@cal{C}} it is always possible to convert from
 | |
| @math{@cal{A}} to @math{@cal{C}} directly).  If the @code{iconv_open}
 | |
| returns an error and sets @code{errno} to @code{EINVAL}, there is no
 | |
| known way, directly or indirectly, to perform the wanted conversion.
 | |
| 
 | |
| @cindex triangulation
 | |
| Triangulation is achieved by providing for each character set a
 | |
| conversion from and to UCS-4 encoded @w{ISO 10646}.  Using @w{ISO 10646}
 | |
| as an intermediate representation it is possible to @dfn{triangulate}
 | |
| (i.e., convert with an intermediate representation).
 | |
| 
 | |
| There is no inherent requirement to provide a conversion to @w{ISO
 | |
| 10646} for a new character set, and it is also possible to provide other
 | |
| conversions where neither source nor destination character set is @w{ISO
 | |
| 10646}.  The existing set of conversions is simply meant to cover all
 | |
| conversions that might be of interest.
 | |
| 
 | |
| @cindex ISO-2022-JP
 | |
| @cindex EUC-JP
 | |
| All currently available conversions use the triangulation method above,
 | |
| making conversion run unnecessarily slow.  If, for example, somebody
 | |
| often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
 | |
| would involve direct conversion between the two character sets, skipping
 | |
| the input to @w{ISO 10646} first.  The two character sets of interest
 | |
| are much more similar to each other than to @w{ISO 10646}.
 | |
| 
 | |
| In such a situation one easily can write a new conversion and provide it
 | |
| as a better alternative.  The GNU C library @code{iconv} implementation
 | |
| would automatically use the module implementing the conversion if it is
 | |
| specified to be more efficient.
 | |
| 
 | |
| @subsubsection Format of @file{gconv-modules} files
 | |
| 
 | |
| All information about the available conversions comes from a file named
 | |
| @file{gconv-modules}, which can be found in any of the directories along
 | |
| the @code{GCONV_PATH}.  The @file{gconv-modules} files are line-oriented
 | |
| text files, where each of the lines has one of the following formats:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| If the first non-whitespace character is a @kbd{#} the line contains only
 | |
| comments and is ignored.
 | |
| 
 | |
| @item
 | |
| Lines starting with @code{alias} define an alias name for a character
 | |
| set.  Two more words are expected on the line.  The first word
 | |
| defines the alias name, and the second defines the original name of the
 | |
| character set.  The effect is that it is possible to use the alias name
 | |
| in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
 | |
| achieve the same result as when using the real character set name.
 | |
| 
 | |
| This is quite important as a character set has often many different
 | |
| names.  There is normally an official name but this need not correspond to
 | |
| the most popular name.  Beside this many character sets have special
 | |
| names that are somehow constructed.  For example, all character sets
 | |
| specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}}
 | |
| where @var{nnn} is the registration number.  This allows programs that
 | |
| know about the registration number to construct character set names and
 | |
| use them in @code{iconv_open} calls.  More on the available names and
 | |
| aliases follows below.
 | |
| 
 | |
| @item
 | |
| Lines starting with @code{module} introduce an available conversion
 | |
| module.  These lines must contain three or four more words.
 | |
| 
 | |
| The first word specifies the source character set, the second word the
 | |
| destination character set of conversion implemented in this module, and
 | |
| the third word is the name of the loadable module.  The filename is
 | |
| constructed by appending the usual shared object suffix (normally
 | |
| @file{.so}) and this file is then supposed to be found in the same
 | |
| directory the @file{gconv-modules} file is in.  The last word on the line,
 | |
| which is optional, is a numeric value representing the cost of the
 | |
| conversion.  If this word is missing, a cost of @math{1} is assumed.  The
 | |
| numeric value itself does not matter that much; what counts are the
 | |
| relative values of the sums of costs for all possible conversion paths.
 | |
| Below is a more precise description of the use of the cost value.
 | |
| @end itemize
 | |
| 
 | |
| Returning to the example above where one has written a module to directly
 | |
| convert from ISO-2022-JP to EUC-JP and back.  All that has to be done is
 | |
| to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
 | |
| and add a file @file{gconv-modules} with the following content in the
 | |
| same directory:
 | |
| 
 | |
| @smallexample
 | |
| module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1
 | |
| module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1
 | |
| @end smallexample
 | |
| 
 | |
| To see why this is sufficient, it is necessary to understand how the
 | |
| conversion used by @code{iconv} (and described in the descriptor) is
 | |
| selected.  The approach to this problem is quite simple.
 | |
| 
 | |
| At the first call of the @code{iconv_open} function the program reads
 | |
| all available @file{gconv-modules} files and builds up two tables: one
 | |
| containing all the known aliases and another that contains the
 | |
| information about the conversions and which shared object implements
 | |
| them.
 | |
| 
 | |
| @subsubsection Finding the conversion path in @code{iconv}
 | |
| 
 | |
| The set of available conversions form a directed graph with weighted
 | |
| edges.  The weights on the edges are the costs specified in the
 | |
| @file{gconv-modules} files.  The @code{iconv_open} function uses an
 | |
| algorithm suitable for search for the best path in such a graph and so
 | |
| constructs a list of conversions that must be performed in succession
 | |
| to get the transformation from the source to the destination character
 | |
| set.
 | |
| 
 | |
| Explaining why the above @file{gconv-modules} files allows the
 | |
| @code{iconv} implementation to resolve the specific ISO-2022-JP to
 | |
| EUC-JP conversion module instead of the conversion coming with the
 | |
| library itself is straightforward.  Since the latter conversion takes two
 | |
| steps (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
 | |
| EUC-JP), the cost is @math{1+1 = 2}.  The above @file{gconv-modules}
 | |
| file, however, specifies that the new conversion modules can perform this
 | |
| conversion with only the cost of @math{1}.
 | |
| 
 | |
| A mysterious item about the @file{gconv-modules} file above (and also
 | |
| the file coming with the GNU C library) are the names of the character
 | |
| sets specified in the @code{module} lines.  Why do almost all the names
 | |
| end in @code{//}?  And this is not all: the names can actually be
 | |
| regular expressions.  At this point in time this mystery should not be
 | |
| revealed, unless you have the relevant spell-casting materials: ashes
 | |
| from an original @w{DOS 6.2} boot disk burnt in effigy, a crucifix
 | |
| blessed by St.@: Emacs, assorted herbal roots from Central America, sand
 | |
| from Cebu, etc.  Sorry!  @strong{The part of the implementation where
 | |
| this is used is not yet finished.  For now please simply follow the
 | |
| existing examples.  It'll become clearer once it is. --drepper}
 | |
| 
 | |
| A last remark about the @file{gconv-modules} is about the names not
 | |
| ending with @code{//}.  A character set named @code{INTERNAL} is often
 | |
| mentioned.  From the discussion above and the chosen name it should have
 | |
| become clear that this is the name for the representation used in the
 | |
| intermediate step of the triangulation.  We have said that this is UCS-4
 | |
| but actually that is not quite right.  The UCS-4 specification also
 | |
| includes the specification of the byte ordering used.  Since a UCS-4 value
 | |
| consists of four bytes, a stored value is effected by byte ordering.  The
 | |
| internal representation is @emph{not} the same as UCS-4 in case the byte
 | |
| ordering of the processor (or at least the running process) is not the
 | |
| same as the one required for UCS-4.  This is done for performance reasons
 | |
| as one does not want to perform unnecessary byte-swapping operations if
 | |
| one is not interested in actually seeing the result in UCS-4.  To avoid
 | |
| trouble with endianness, the internal representation consistently is named
 | |
| @code{INTERNAL} even on big-endian systems where the representations are
 | |
| identical.
 | |
| 
 | |
| @subsubsection @code{iconv} module data structures
 | |
| 
 | |
| So far this section has described how modules are located and considered
 | |
| to be used.  What remains to be described is the interface of the modules
 | |
| so that one can write new ones. This section describes the interface as
 | |
| it is in use in January 1999.  The interface will change a bit in the
 | |
| future but, with luck, only in an upwardly compatible way.
 | |
| 
 | |
| The definitions necessary to write new modules are publicly available
 | |
| in the non-standard header @file{gconv.h}.  The following text,
 | |
| therefore, describes the definitions from this header file.  First,
 | |
| however, it is necessary to get an overview.
 | |
| 
 | |
| From the perspective of the user of @code{iconv} the interface is quite
 | |
| simple: the @code{iconv_open} function returns a handle that can be used
 | |
| in calls to @code{iconv}, and finally the handle is freed with a call to
 | |
| @code{iconv_close}.  The problem is that the handle has to be able to
 | |
| represent the possibly long sequences of conversion steps and also the
 | |
| state of each conversion since the handle is all that is passed to the
 | |
| @code{iconv} function.  Therefore, the data structures are really the
 | |
| elements necessary to understanding the implementation.
 | |
| 
 | |
| We need two different kinds of data structures.  The first describes the
 | |
| conversion and the second describes the state etc.  There are really two
 | |
| type definitions like this in @file{gconv.h}.
 | |
| @pindex gconv.h
 | |
| 
 | |
| @comment gconv.h
 | |
| @comment GNU
 | |
| @deftp {Data type} {struct __gconv_step}
 | |
| This data structure describes one conversion a module can perform.  For
 | |
| each function in a loaded module with conversion functions there is
 | |
| exactly one object of this type.  This object is shared by all users of
 | |
| the conversion (i.e., this object does not contain any information
 | |
| corresponding to an actual conversion; it only describes the conversion
 | |
| itself).
 | |
| 
 | |
| @table @code
 | |
| @item struct __gconv_loaded_object *__shlib_handle
 | |
| @itemx const char *__modname
 | |
| @itemx int __counter
 | |
| All these elements of the structure are used internally in the C library
 | |
| to coordinate loading and unloading the shared.  One must not expect any
 | |
| of the other elements to be available or initialized.
 | |
| 
 | |
| @item const char *__from_name
 | |
| @itemx const char *__to_name
 | |
| @code{__from_name} and @code{__to_name} contain the names of the source and
 | |
| destination character sets.  They can be used to identify the actual
 | |
| conversion to be carried out since one module might implement conversions
 | |
| for more than one character set and/or direction.
 | |
| 
 | |
| @item gconv_fct __fct
 | |
| @itemx gconv_init_fct __init_fct
 | |
| @itemx gconv_end_fct __end_fct
 | |
| These elements contain pointers to the functions in the loadable module.
 | |
| The interface will be explained below.
 | |
| 
 | |
| @item int __min_needed_from
 | |
| @itemx int __max_needed_from
 | |
| @itemx int __min_needed_to
 | |
| @itemx int __max_needed_to;
 | |
| These values have to be supplied in the init function of the module.  The
 | |
| @code{__min_needed_from} value specifies how many bytes a character of
 | |
| the source character set at least needs.  The @code{__max_needed_from}
 | |
| specifies the maximum value that also includes possible shift sequences.
 | |
| 
 | |
| The @code{__min_needed_to} and @code{__max_needed_to} values serve the
 | |
| same purpose as @code{__min_needed_from} and @code{__max_needed_from} but
 | |
| this time for the destination character set.
 | |
| 
 | |
| It is crucial that these values be accurate since otherwise the
 | |
| conversion functions will have problems or not work at all.
 | |
| 
 | |
| @item int __stateful
 | |
| This element must also be initialized by the init function.
 | |
| @code{int __stateful} is nonzero if the source character set is stateful.
 | |
| Otherwise it is zero.
 | |
| 
 | |
| @item void *__data
 | |
| This element can be used freely by the conversion functions in the
 | |
| module.  @code{void *__data} can be used to communicate extra information
 | |
| from one call to another.  @code{void *__data} need not be initialized if
 | |
| not needed at all.  If @code{void *__data} element is assigned a pointer
 | |
| to dynamically allocated memory (presumably in the init function) it has
 | |
| to be made sure that the end function deallocates the memory.  Otherwise
 | |
| the application will leak memory.
 | |
| 
 | |
| It is important to be aware that this data structure is shared by all
 | |
| users of this specification conversion and therefore the @code{__data}
 | |
| element must not contain data specific to one specific use of the
 | |
| conversion function.
 | |
| @end table
 | |
| @end deftp
 | |
| 
 | |
| @comment gconv.h
 | |
| @comment GNU
 | |
| @deftp {Data type} {struct __gconv_step_data}
 | |
| This is the data structure that contains the information specific to
 | |
| each use of the conversion functions.
 | |
| 
 | |
| 
 | |
| @table @code
 | |
| @item char *__outbuf
 | |
| @itemx char *__outbufend
 | |
| These elements specify the output buffer for the conversion step.  The
 | |
| @code{__outbuf} element points to the beginning of the buffer, and
 | |
| @code{__outbufend} points to the byte following the last byte in the
 | |
| buffer.  The conversion function must not assume anything about the size
 | |
| of the buffer but it can be safely assumed the there is room for at
 | |
| least one complete character in the output buffer.
 | |
| 
 | |
| Once the conversion is finished, if the conversion is the last step, the
 | |
| @code{__outbuf} element must be modified to point after the last byte
 | |
| written into the buffer to signal how much output is available.  If this
 | |
| conversion step is not the last one, the element must not be modified.
 | |
| The @code{__outbufend} element must not be modified.
 | |
| 
 | |
| @item int __is_last
 | |
| This element is nonzero if this conversion step is the last one.  This
 | |
| information is necessary for the recursion.  See the description of the
 | |
| conversion function internals below.  This element must never be
 | |
| modified.
 | |
| 
 | |
| @item int __invocation_counter
 | |
| The conversion function can use this element to see how many calls of
 | |
| the conversion function already happened.  Some character sets require a
 | |
| certain prolog when generating output, and by comparing this value with
 | |
| zero, one can find out whether it is the first call and whether,
 | |
| therefore, the prolog should be emitted.  This element must never be
 | |
| modified.
 | |
| 
 | |
| @item int __internal_use
 | |
| This element is another one rarely used but needed in certain
 | |
| situations.  It is assigned a nonzero value in case the conversion
 | |
| functions are used to implement @code{mbsrtowcs} et.al.@: (i.e., the
 | |
| function is not used directly through the @code{iconv} interface).
 | |
| 
 | |
| This sometimes makes a difference as it is expected that the
 | |
| @code{iconv} functions are used to translate entire texts while the
 | |
| @code{mbsrtowcs} functions are normally used only to convert single
 | |
| strings and might be used multiple times to convert entire texts.
 | |
| 
 | |
| But in this situation we would have problem complying with some rules of
 | |
| the character set specification.  Some character sets require a prolog,
 | |
| which must appear exactly once for an entire text.  If a number of
 | |
| @code{mbsrtowcs} calls are used to convert the text, only the first call
 | |
| must add the prolog.  However, because there is no communication between the
 | |
| different calls of @code{mbsrtowcs}, the conversion functions have no
 | |
| possibility to find this out.  The situation is different for sequences
 | |
| of @code{iconv} calls since the handle allows access to the needed
 | |
| information.
 | |
| 
 | |
| The @code{int __internal_use} element is mostly used together with
 | |
| @code{__invocation_counter} as follows:
 | |
| 
 | |
| @smallexample
 | |
| if (!data->__internal_use
 | |
|      && data->__invocation_counter == 0)
 | |
|   /* @r{Emit prolog.}  */
 | |
|   @dots{}
 | |
| @end smallexample
 | |
| 
 | |
| This element must never be modified.
 | |
| 
 | |
| @item mbstate_t *__statep
 | |
| The @code{__statep} element points to an object of type @code{mbstate_t}
 | |
| (@pxref{Keeping the state}).  The conversion of a stateful character
 | |
| set must use the object pointed to by @code{__statep} to store
 | |
| information about the conversion state.  The @code{__statep} element
 | |
| itself must never be modified.
 | |
| 
 | |
| @item mbstate_t __state
 | |
| This element must @emph{never} be used directly.  It is only part of
 | |
| this structure to have the needed space allocated.
 | |
| @end table
 | |
| @end deftp
 | |
| 
 | |
| @subsubsection @code{iconv} module interfaces
 | |
| 
 | |
| With the knowledge about the data structures we now can describe the
 | |
| conversion function itself.  To understand the interface a bit of
 | |
| knowledge is necessary about the functionality in the C library that
 | |
| loads the objects with the conversions.
 | |
| 
 | |
| It is often the case that one conversion is used more than once (i.e.,
 | |
| there are several @code{iconv_open} calls for the same set of character
 | |
| sets during one program run).  The @code{mbsrtowcs} et.al.@: functions in
 | |
| the GNU C library also use the @code{iconv} functionality, which
 | |
| increases the number of uses of the same functions even more.
 | |
| 
 | |
| Because of this multiple use of conversions, the modules do not get
 | |
| loaded exclusively for one conversion.  Instead a module once loaded can
 | |
| be used by an arbitrary number of @code{iconv} or @code{mbsrtowcs} calls
 | |
| at the same time.  The splitting of the information between conversion-
 | |
| function-specific information and conversion data makes this possible.
 | |
| The last section showed the two data structures used to do this.
 | |
| 
 | |
| This is of course also reflected in the interface and semantics of the
 | |
| functions that the modules must provide.  There are three functions that
 | |
| must have the following names:
 | |
| 
 | |
| @table @code
 | |
| @item gconv_init
 | |
| The @code{gconv_init} function initializes the conversion function
 | |
| specific data structure.  This very same object is shared by all
 | |
| conversions that use this conversion and, therefore, no state information
 | |
| about the conversion itself must be stored in here.  If a module
 | |
| implements more than one conversion, the @code{gconv_init} function will
 | |
| be called multiple times.
 | |
| 
 | |
| @item gconv_end
 | |
| The @code{gconv_end} function is responsible for freeing all resources
 | |
| allocated by the @code{gconv_init} function.  If there is nothing to do,
 | |
| this function can be missing.  Special care must be taken if the module
 | |
| implements more than one conversion and the @code{gconv_init} function
 | |
| does not allocate the same resources for all conversions.
 | |
| 
 | |
| @item gconv
 | |
| This is the actual conversion function.  It is called to convert one
 | |
| block of text.  It gets passed the conversion step information
 | |
| initialized by @code{gconv_init} and the conversion data, specific to
 | |
| this use of the conversion functions.
 | |
| @end table
 | |
| 
 | |
| There are three data types defined for the three module interface
 | |
| functions and these define the interface.
 | |
| 
 | |
| @comment gconv.h
 | |
| @comment GNU
 | |
| @deftypevr {Data type} int {(*__gconv_init_fct)} (struct __gconv_step *)
 | |
| This specifies the interface of the initialization function of the
 | |
| module.  It is called exactly once for each conversion the module
 | |
| implements.
 | |
| 
 | |
| As explained in the description of the @code{struct __gconv_step} data
 | |
| structure above the initialization function has to initialize parts of
 | |
| it.
 | |
| 
 | |
| @table @code
 | |
| @item __min_needed_from
 | |
| @itemx __max_needed_from
 | |
| @itemx __min_needed_to
 | |
| @itemx __max_needed_to
 | |
| These elements must be initialized to the exact numbers of the minimum
 | |
| and maximum number of bytes used by one character in the source and
 | |
| destination character sets, respectively.  If the characters all have the
 | |
| same size, the minimum and maximum values are the same.
 | |
| 
 | |
| @item __stateful
 | |
| This element must be initialized to an nonzero value if the source
 | |
| character set is stateful.  Otherwise it must be zero.
 | |
| @end table
 | |
| 
 | |
| If the initialization function needs to communicate some information
 | |
| to the conversion function, this communication can happen using the
 | |
| @code{__data} element of the @code{__gconv_step} structure.  But since
 | |
| this data is shared by all the conversions, it must not be modified by
 | |
| the conversion function.  The example below shows how this can be used.
 | |
| 
 | |
| @smallexample
 | |
| #define MIN_NEEDED_FROM         1
 | |
| #define MAX_NEEDED_FROM         4
 | |
| #define MIN_NEEDED_TO           4
 | |
| #define MAX_NEEDED_TO           4
 | |
| 
 | |
| int
 | |
| gconv_init (struct __gconv_step *step)
 | |
| @{
 | |
|   /* @r{Determine which direction.}  */
 | |
|   struct iso2022jp_data *new_data;
 | |
|   enum direction dir = illegal_dir;
 | |
|   enum variant var = illegal_var;
 | |
|   int result;
 | |
| 
 | |
|   if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
 | |
|     @{
 | |
|       dir = from_iso2022jp;
 | |
|       var = iso2022jp;
 | |
|     @}
 | |
|   else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
 | |
|     @{
 | |
|       dir = to_iso2022jp;
 | |
|       var = iso2022jp;
 | |
|     @}
 | |
|   else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
 | |
|     @{
 | |
|       dir = from_iso2022jp;
 | |
|       var = iso2022jp2;
 | |
|     @}
 | |
|   else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
 | |
|     @{
 | |
|       dir = to_iso2022jp;
 | |
|       var = iso2022jp2;
 | |
|     @}
 | |
| 
 | |
|   result = __GCONV_NOCONV;
 | |
|   if (dir != illegal_dir)
 | |
|     @{
 | |
|       new_data = (struct iso2022jp_data *)
 | |
|         malloc (sizeof (struct iso2022jp_data));
 | |
| 
 | |
|       result = __GCONV_NOMEM;
 | |
|       if (new_data != NULL)
 | |
|         @{
 | |
|           new_data->dir = dir;
 | |
|           new_data->var = var;
 | |
|           step->__data = new_data;
 | |
| 
 | |
|           if (dir == from_iso2022jp)
 | |
|             @{
 | |
|               step->__min_needed_from = MIN_NEEDED_FROM;
 | |
|               step->__max_needed_from = MAX_NEEDED_FROM;
 | |
|               step->__min_needed_to = MIN_NEEDED_TO;
 | |
|               step->__max_needed_to = MAX_NEEDED_TO;
 | |
|             @}
 | |
|           else
 | |
|             @{
 | |
|               step->__min_needed_from = MIN_NEEDED_TO;
 | |
|               step->__max_needed_from = MAX_NEEDED_TO;
 | |
|               step->__min_needed_to = MIN_NEEDED_FROM;
 | |
|               step->__max_needed_to = MAX_NEEDED_FROM + 2;
 | |
|             @}
 | |
| 
 | |
|           /* @r{Yes, this is a stateful encoding.}  */
 | |
|           step->__stateful = 1;
 | |
| 
 | |
|           result = __GCONV_OK;
 | |
|         @}
 | |
|     @}
 | |
| 
 | |
|   return result;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| The function first checks which conversion is wanted.  The module from
 | |
| which this function is taken implements four different conversions;
 | |
| which one is selected can be determined by comparing the names.  The
 | |
| comparison should always be done without paying attention to the case.
 | |
| 
 | |
| Next, a data structure, which contains the necessary information about
 | |
| which conversion is selected, is allocated.  The data structure
 | |
| @code{struct iso2022jp_data} is locally defined since, outside the
 | |
| module, this data is not used at all.  Please note that if all four
 | |
| conversions this modules supports are requested there are four data
 | |
| blocks.
 | |
| 
 | |
| One interesting thing is the initialization of the @code{__min_} and
 | |
| @code{__max_} elements of the step data object.  A single ISO-2022-JP
 | |
| character can consist of one to four bytes.  Therefore the
 | |
| @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
 | |
| this way.  The output is always the @code{INTERNAL} character set (aka
 | |
| UCS-4) and therefore each character consists of exactly four bytes.  For
 | |
| the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
 | |
| account that escape sequences might be necessary to switch the character
 | |
| sets.  Therefore the @code{__max_needed_to} element for this direction
 | |
| gets assigned @code{MAX_NEEDED_FROM + 2}.  This takes into account the
 | |
| two bytes needed for the escape sequences to single the switching.  The
 | |
| asymmetry in the maximum values for the two directions can be explained
 | |
| easily: when reading ISO-2022-JP text, escape sequences can be handled
 | |
| alone (i.e., it is not necessary to process a real character since the
 | |
| effect of the escape sequence can be recorded in the state information).
 | |
| The situation is different for the other direction.  Since it is in
 | |
| general not known which character comes next, one cannot emit escape
 | |
| sequences to change the state in advance.  This means the escape
 | |
| sequences that have to be emitted together with the next character.
 | |
| Therefore one needs more room than only for the character itself.
 | |
| 
 | |
| The possible return values of the initialization function are:
 | |
| 
 | |
| @table @code
 | |
| @item __GCONV_OK
 | |
| The initialization succeeded
 | |
| @item __GCONV_NOCONV
 | |
| The requested conversion is not supported in the module.  This can
 | |
| happen if the @file{gconv-modules} file has errors.
 | |
| @item __GCONV_NOMEM
 | |
| Memory required to store additional information could not be allocated.
 | |
| @end table
 | |
| @end deftypevr
 | |
| 
 | |
| The function called before the module is unloaded is significantly
 | |
| easier.  It often has nothing at all to do; in which case it can be left
 | |
| out completely.
 | |
| 
 | |
| @comment gconv.h
 | |
| @comment GNU
 | |
| @deftypevr {Data type} void {(*__gconv_end_fct)} (struct gconv_step *)
 | |
| The task of this function is to free all resources allocated in the
 | |
| initialization function.  Therefore only the @code{__data} element of
 | |
| the object pointed to by the argument is of interest.  Continuing the
 | |
| example from the initialization function, the finalization function
 | |
| looks like this:
 | |
| 
 | |
| @smallexample
 | |
| void
 | |
| gconv_end (struct __gconv_step *data)
 | |
| @{
 | |
|   free (data->__data);
 | |
| @}
 | |
| @end smallexample
 | |
| @end deftypevr
 | |
| 
 | |
| The most important function is the conversion function itself, which can
 | |
| get quite complicated for complex character sets.  But since this is not
 | |
| of interest here, we will only describe a possible skeleton for the
 | |
| conversion function.
 | |
| 
 | |
| @comment gconv.h
 | |
| @comment GNU
 | |
| @deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
 | |
| The conversion function can be called for two basic reason: to convert
 | |
| text or to reset the state.  From the description of the @code{iconv}
 | |
| function it can be seen why the flushing mode is necessary.  What mode
 | |
| is selected is determined by the sixth argument, an integer.  This
 | |
| argument being nonzero means that flushing is selected.
 | |
| 
 | |
| Common to both modes is where the output buffer can be found.  The
 | |
| information about this buffer is stored in the conversion step data.  A
 | |
| pointer to this information is passed as the second argument to this
 | |
| function.  The description of the @code{struct __gconv_step_data}
 | |
| structure has more information on the conversion step data.
 | |
| 
 | |
| @cindex stateful
 | |
| What has to be done for flushing depends on the source character set.
 | |
| If the source character set is not stateful, nothing has to be done.
 | |
| Otherwise the function has to emit a byte sequence to bring the state
 | |
| object into the initial state.  Once this all happened the other
 | |
| conversion modules in the chain of conversions have to get the same
 | |
| chance.  Whether another step follows can be determined from the
 | |
| @code{__is_last} element of the step data structure to which the first
 | |
| parameter points.
 | |
| 
 | |
| The more interesting mode is when actual text has to be converted.  The
 | |
| first step in this case is to convert as much text as possible from the
 | |
| input buffer and store the result in the output buffer.  The start of the
 | |
| input buffer is determined by the third argument, which is a pointer to a
 | |
| pointer variable referencing the beginning of the buffer.  The fourth
 | |
| argument is a pointer to the byte right after the last byte in the buffer.
 | |
| 
 | |
| The conversion has to be performed according to the current state if the
 | |
| character set is stateful.  The state is stored in an object pointed to
 | |
| by the @code{__statep} element of the step data (second argument).  Once
 | |
| either the input buffer is empty or the output buffer is full the
 | |
| conversion stops.  At this point, the pointer variable referenced by the
 | |
| third parameter must point to the byte following the last processed
 | |
| byte (i.e., if all of the input is consumed, this pointer and the fourth
 | |
| parameter have the same value).
 | |
| 
 | |
| What now happens depends on whether this step is the last one.  If it is
 | |
| the last step, the only thing that has to be done is to update the
 | |
| @code{__outbuf} element of the step data structure to point after the
 | |
| last written byte.  This update gives the caller the information on how
 | |
| much text is available in the output buffer.  In addition, the variable
 | |
| pointed to by the fifth parameter, which is of type @code{size_t}, must
 | |
| be incremented by the number of characters (@emph{not bytes}) that were
 | |
| converted in a non-reversible way.  Then, the function can return.
 | |
| 
 | |
| In case the step is not the last one, the later conversion functions have
 | |
| to get a chance to do their work.  Therefore, the appropriate conversion
 | |
| function has to be called.  The information about the functions is
 | |
| stored in the conversion data structures, passed as the first parameter.
 | |
| This information and the step data are stored in arrays, so the next
 | |
| element in both cases can be found by simple pointer arithmetic:
 | |
| 
 | |
| @smallexample
 | |
| int
 | |
| gconv (struct __gconv_step *step, struct __gconv_step_data *data,
 | |
|        const char **inbuf, const char *inbufend, size_t *written,
 | |
|        int do_flush)
 | |
| @{
 | |
|   struct __gconv_step *next_step = step + 1;
 | |
|   struct __gconv_step_data *next_data = data + 1;
 | |
|   @dots{}
 | |
| @end smallexample
 | |
| 
 | |
| The @code{next_step} pointer references the next step information and
 | |
| @code{next_data} the next data record.  The call of the next function
 | |
| therefore will look similar to this:
 | |
| 
 | |
| @smallexample
 | |
|   next_step->__fct (next_step, next_data, &outerr, outbuf,
 | |
|                     written, 0)
 | |
| @end smallexample
 | |
| 
 | |
| But this is not yet all.  Once the function call returns the conversion
 | |
| function might have some more to do.  If the return value of the function
 | |
| is @code{__GCONV_EMPTY_INPUT}, more room is available in the output
 | |
| buffer.  Unless the input buffer is empty the conversion, functions start
 | |
| all over again and process the rest of the input buffer.  If the return
 | |
| value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have
 | |
| to recover from this.
 | |
| 
 | |
| A requirement for the conversion function is that the input buffer
 | |
| pointer (the third argument) always point to the last character that
 | |
| was put in converted form into the output buffer.  This is trivially
 | |
| true after the conversion performed in the current step, but if the
 | |
| conversion functions deeper downstream stop prematurely, not all
 | |
| characters from the output buffer are consumed and, therefore, the input
 | |
| buffer pointers must be backed off to the right position.
 | |
| 
 | |
| Correcting the input buffers is easy to do if the input and output
 | |
| character sets have a fixed width for all characters.  In this situation
 | |
| we can compute how many characters are left in the output buffer and,
 | |
| therefore, can correct the input buffer pointer appropriately with a
 | |
| similar computation.  Things are getting tricky if either character set
 | |
| has characters represented with variable length byte sequences, and it
 | |
| gets even more complicated if the conversion has to take care of the
 | |
| state.  In these cases the conversion has to be performed once again, from
 | |
| the known state before the initial conversion (i.e., if necessary the
 | |
| state of the conversion has to be reset and the conversion loop has to be
 | |
| executed again).  The difference now is that it is known how much input
 | |
| must be created, and the conversion can stop before converting the first
 | |
| unused character.  Once this is done the input buffer pointers must be
 | |
| updated again and the function can return.
 | |
| 
 | |
| One final thing should be mentioned.  If it is necessary for the
 | |
| conversion to know whether it is the first invocation (in case a prolog
 | |
| has to be emitted), the conversion function should increment the
 | |
| @code{__invocation_counter} element of the step data structure just
 | |
| before returning to the caller.  See the description of the @code{struct
 | |
| __gconv_step_data} structure above for more information on how this can
 | |
| be used.
 | |
| 
 | |
| The return value must be one of the following values:
 | |
| 
 | |
| @table @code
 | |
| @item __GCONV_EMPTY_INPUT
 | |
| All input was consumed and there is room left in the output buffer.
 | |
| @item __GCONV_FULL_OUTPUT
 | |
| No more room in the output buffer.  In case this is not the last step
 | |
| this value is propagated down from the call of the next conversion
 | |
| function in the chain.
 | |
| @item __GCONV_INCOMPLETE_INPUT
 | |
| The input buffer is not entirely empty since it contains an incomplete
 | |
| character sequence.
 | |
| @end table
 | |
| 
 | |
| The following example provides a framework for a conversion function.
 | |
| In case a new conversion has to be written the holes in this
 | |
| implementation have to be filled and that is it.
 | |
| 
 | |
| @smallexample
 | |
| int
 | |
| gconv (struct __gconv_step *step, struct __gconv_step_data *data,
 | |
|        const char **inbuf, const char *inbufend, size_t *written,
 | |
|        int do_flush)
 | |
| @{
 | |
|   struct __gconv_step *next_step = step + 1;
 | |
|   struct __gconv_step_data *next_data = data + 1;
 | |
|   gconv_fct fct = next_step->__fct;
 | |
|   int status;
 | |
| 
 | |
|   /* @r{If the function is called with no input this means we have}
 | |
|      @r{to reset to the initial state.  The possibly partly}
 | |
|      @r{converted input is dropped.}  */
 | |
|   if (do_flush)
 | |
|     @{
 | |
|       status = __GCONV_OK;
 | |
| 
 | |
|       /* @r{Possible emit a byte sequence which put the state object}
 | |
|          @r{into the initial state.}  */
 | |
| 
 | |
|       /* @r{Call the steps down the chain if there are any but only}
 | |
|          @r{if we successfully emitted the escape sequence.}  */
 | |
|       if (status == __GCONV_OK && ! data->__is_last)
 | |
|         status = fct (next_step, next_data, NULL, NULL,
 | |
|                       written, 1);
 | |
|     @}
 | |
|   else
 | |
|     @{
 | |
|       /* @r{We preserve the initial values of the pointer variables.}  */
 | |
|       const char *inptr = *inbuf;
 | |
|       char *outbuf = data->__outbuf;
 | |
|       char *outend = data->__outbufend;
 | |
|       char *outptr;
 | |
| 
 | |
|       do
 | |
|         @{
 | |
|           /* @r{Remember the start value for this round.}  */
 | |
|           inptr = *inbuf;
 | |
|           /* @r{The outbuf buffer is empty.}  */
 | |
|           outptr = outbuf;
 | |
| 
 | |
|           /* @r{For stateful encodings the state must be safe here.}  */
 | |
| 
 | |
|           /* @r{Run the conversion loop.  @code{status} is set}
 | |
|              @r{appropriately afterwards.}  */
 | |
| 
 | |
|           /* @r{If this is the last step, leave the loop. There is}
 | |
|              @r{nothing we can do.}  */
 | |
|           if (data->__is_last)
 | |
|             @{
 | |
|               /* @r{Store information about how many bytes are}
 | |
|                  @r{available.}  */
 | |
|               data->__outbuf = outbuf;
 | |
| 
 | |
|              /* @r{If any non-reversible conversions were performed,}
 | |
|                 @r{add the number to @code{*written}.}  */
 | |
| 
 | |
|              break;
 | |
|            @}
 | |
| 
 | |
|           /* @r{Write out all output that was produced.}  */
 | |
|           if (outbuf > outptr)
 | |
|             @{
 | |
|               const char *outerr = data->__outbuf;
 | |
|               int result;
 | |
| 
 | |
|               result = fct (next_step, next_data, &outerr,
 | |
|                             outbuf, written, 0);
 | |
| 
 | |
|               if (result != __GCONV_EMPTY_INPUT)
 | |
|                 @{
 | |
|                   if (outerr != outbuf)
 | |
|                     @{
 | |
|                       /* @r{Reset the input buffer pointer.  We}
 | |
|                          @r{document here the complex case.}  */
 | |
|                       size_t nstatus;
 | |
| 
 | |
|                       /* @r{Reload the pointers.}  */
 | |
|                       *inbuf = inptr;
 | |
|                       outbuf = outptr;
 | |
| 
 | |
|                       /* @r{Possibly reset the state.}  */
 | |
| 
 | |
|                       /* @r{Redo the conversion, but this time}
 | |
|                          @r{the end of the output buffer is at}
 | |
|                          @r{@code{outerr}.}  */
 | |
|                     @}
 | |
| 
 | |
|                   /* @r{Change the status.}  */
 | |
|                   status = result;
 | |
|                 @}
 | |
|               else
 | |
|                 /* @r{All the output is consumed, we can make}
 | |
|                    @r{ another run if everything was ok.}  */
 | |
|                 if (status == __GCONV_FULL_OUTPUT)
 | |
|                   status = __GCONV_OK;
 | |
|            @}
 | |
|         @}
 | |
|       while (status == __GCONV_OK);
 | |
| 
 | |
|       /* @r{We finished one use of this step.}  */
 | |
|       ++data->__invocation_counter;
 | |
|     @}
 | |
| 
 | |
|   return status;
 | |
| @}
 | |
| @end smallexample
 | |
| @end deftypevr
 | |
| 
 | |
| This information should be sufficient to write new modules.  Anybody
 | |
| doing so should also take a look at the available source code in the GNU
 | |
| C library sources.  It contains many examples of working and optimized
 | |
| modules.
 | |
| 
 | |
| @c File charset.texi edited October 2001 by Dennis Grace, IBM Corporation
 |