mirror of
				https://sourceware.org/git/glibc.git
				synced 2025-10-24 13:33:08 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			1816 lines
		
	
	
		
			72 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1816 lines
		
	
	
		
			72 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| @node Message Translation, Searching and Sorting, Locales, Top
 | |
| @c %MENU% How to make the program speak the user's language
 | |
| @chapter Message Translation
 | |
| 
 | |
| The program's interface with the user should be designed to ease the user's
 | |
| task.  One way to ease the user's task is to use messages in whatever
 | |
| language the user prefers.
 | |
| 
 | |
| Printing messages in different languages can be implemented in different
 | |
| ways.  One could add all the different languages in the source code and
 | |
| choose among the variants every time a message has to be printed.  This is
 | |
| certainly not a good solution since extending the set of languages is
 | |
| cumbersome (the code must be changed) and the code itself can become
 | |
| really big with dozens of message sets.
 | |
| 
 | |
| A better solution is to keep the message sets for each language
 | |
| in separate files which are loaded at runtime depending on the language
 | |
| selection of the user.
 | |
| 
 | |
| @Theglibc{} provides two different sets of functions to support
 | |
| message translation.  The problem is that neither of the interfaces is
 | |
| officially defined by the POSIX standard.  The @code{catgets} family of
 | |
| functions is defined in the X/Open standard but this is derived from
 | |
| industry decisions and therefore not necessarily based on reasonable
 | |
| decisions.
 | |
| 
 | |
| As mentioned above the message catalog handling provides easy
 | |
| extendibility by using external data files which contain the message
 | |
| translations.  I.e., these files contain for each of the messages used
 | |
| in the program a translation for the appropriate language.  So the tasks
 | |
| of the message handling functions are
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| locate the external data file with the appropriate translations
 | |
| @item
 | |
| load the data and make it possible to address the messages
 | |
| @item
 | |
| map a given key to the translated message
 | |
| @end itemize
 | |
| 
 | |
| The two approaches mainly differ in the implementation of this last
 | |
| step.  Decisions made in the last step influence the rest of the design.
 | |
| 
 | |
| @menu
 | |
| * Message catalogs a la X/Open::  The @code{catgets} family of functions.
 | |
| * The Uniforum approach::         The @code{gettext} family of functions.
 | |
| @end menu
 | |
| 
 | |
| 
 | |
| @node Message catalogs a la X/Open
 | |
| @section X/Open Message Catalog Handling
 | |
| 
 | |
| The @code{catgets} functions are based on the simple scheme:
 | |
| 
 | |
| @quotation
 | |
| Associate every message to translate in the source code with a unique
 | |
| identifier.  To retrieve a message from a catalog file solely the
 | |
| identifier is used.
 | |
| @end quotation
 | |
| 
 | |
| This means for the author of the program that s/he will have to make
 | |
| sure the meaning of the identifier in the program code and in the
 | |
| message catalogs are always the same.
 | |
| 
 | |
| Before a message can be translated the catalog file must be located.
 | |
| The user of the program must be able to guide the responsible function
 | |
| to find whatever catalog the user wants.  This is separated from what
 | |
| the programmer had in mind.
 | |
| 
 | |
| All the types, constants and functions for the @code{catgets} functions
 | |
| are defined/declared in the @file{nl_types.h} header file.
 | |
| 
 | |
| @menu
 | |
| * The catgets Functions::      The @code{catgets} function family.
 | |
| * The message catalog files::  Format of the message catalog files.
 | |
| * The gencat program::         How to generate message catalogs files which
 | |
|                                 can be used by the functions.
 | |
| * Common Usage::               How to use the @code{catgets} interface.
 | |
| @end menu
 | |
| 
 | |
| 
 | |
| @node The catgets Functions
 | |
| @subsection The @code{catgets} function family
 | |
| 
 | |
| @comment nl_types.h
 | |
| @comment X/Open
 | |
| @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
 | |
| The @code{catgets} function tries to locate the message data file names
 | |
| @var{cat_name} and loads it when found.  The return value is of an
 | |
| opaque type and can be used in calls to the other functions to refer to
 | |
| this loaded catalog.
 | |
| 
 | |
| The return value is @code{(nl_catd) -1} in case the function failed and
 | |
| no catalog was loaded.  The global variable @var{errno} contains a code
 | |
| for the error causing the failure.  But even if the function call
 | |
| succeeded this does not mean that all messages can be translated.
 | |
| 
 | |
| Locating the catalog file must happen in a way which lets the user of
 | |
| the program influence the decision.  It is up to the user to decide
 | |
| about the language to use and sometimes it is useful to use alternate
 | |
| catalog files.  All this can be specified by the user by setting some
 | |
| environment variables.
 | |
| 
 | |
| The first problem is to find out where all the message catalogs are
 | |
| stored.  Every program could have its own place to keep all the
 | |
| different files but usually the catalog files are grouped by languages
 | |
| and the catalogs for all programs are kept in the same place.
 | |
| 
 | |
| @cindex NLSPATH environment variable
 | |
| To tell the @code{catopen} function where the catalog for the program
 | |
| can be found the user can set the environment variable @code{NLSPATH} to
 | |
| a value which describes her/his choice.  Since this value must be usable
 | |
| for different languages and locales it cannot be a simple string.
 | |
| Instead it is a format string (similar to @code{printf}'s).  An example
 | |
| is
 | |
| 
 | |
| @smallexample
 | |
| /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
 | |
| @end smallexample
 | |
| 
 | |
| First one can see that more than one directory can be specified (with
 | |
| the usual syntax of separating them by colons).  The next things to
 | |
| observe are the format string, @code{%L} and @code{%N} in this case.
 | |
| The @code{catopen} function knows about several of them and the
 | |
| replacement for all of them is of course different.
 | |
| 
 | |
| @table @code
 | |
| @item %N
 | |
| This format element is substituted with the name of the catalog file.
 | |
| This is the value of the @var{cat_name} argument given to
 | |
| @code{catgets}.
 | |
| 
 | |
| @item %L
 | |
| This format element is substituted with the name of the currently
 | |
| selected locale for translating messages.  How this is determined is
 | |
| explained below.
 | |
| 
 | |
| @item %l
 | |
| (This is the lowercase ell.) This format element is substituted with the
 | |
| language element of the locale name.  The string describing the selected
 | |
| locale is expected to have the form
 | |
| @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
 | |
| first part @var{lang}.
 | |
| 
 | |
| @item %t
 | |
| This format element is substituted by the territory part @var{terr} of
 | |
| the name of the currently selected locale.  See the explanation of the
 | |
| format above.
 | |
| 
 | |
| @item %c
 | |
| This format element is substituted by the codeset part @var{codeset} of
 | |
| the name of the currently selected locale.  See the explanation of the
 | |
| format above.
 | |
| 
 | |
| @item %%
 | |
| Since @code{%} is used in a meta character there must be a way to
 | |
| express the @code{%} character in the result itself.  Using @code{%%}
 | |
| does this just like it works for @code{printf}.
 | |
| @end table
 | |
| 
 | |
| 
 | |
| Using @code{NLSPATH} allows arbitrary directories to be searched for
 | |
| message catalogs while still allowing different languages to be used.
 | |
| If the @code{NLSPATH} environment variable is not set, the default value
 | |
| is
 | |
| 
 | |
| @smallexample
 | |
| @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| where @var{prefix} is given to @code{configure} while installing @theglibc{}
 | |
| (this value is in many cases @code{/usr} or the empty string).
 | |
| 
 | |
| The remaining problem is to decide which must be used.  The value
 | |
| decides about the substitution of the format elements mentioned above.
 | |
| First of all the user can specify a path in the message catalog name
 | |
| (i.e., the name contains a slash character).  In this situation the
 | |
| @code{NLSPATH} environment variable is not used.  The catalog must exist
 | |
| as specified in the program, perhaps relative to the current working
 | |
| directory.  This situation in not desirable and catalogs names never
 | |
| should be written this way.  Beside this, this behavior is not portable
 | |
| to all other platforms providing the @code{catgets} interface.
 | |
| 
 | |
| @cindex LC_ALL environment variable
 | |
| @cindex LC_MESSAGES environment variable
 | |
| @cindex LANG environment variable
 | |
| Otherwise the values of environment variables from the standard
 | |
| environment are examined (@pxref{Standard Environment}).  Which
 | |
| variables are examined is decided by the @var{flag} parameter of
 | |
| @code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
 | |
| in @file{nl_types.h}) then the @code{catopen} function use the name of
 | |
| the locale currently selected for the @code{LC_MESSAGES} category.
 | |
| 
 | |
| If @var{flag} is zero the @code{LANG} environment variable is examined.
 | |
| This is a left-over from the early days where the concept of the locales
 | |
| had not even reached the level of POSIX locales.
 | |
| 
 | |
| The environment variable and the locale name should have a value of the
 | |
| form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.
 | |
| If no environment variable is set the @code{"C"} locale is used which
 | |
| prevents any translation.
 | |
| 
 | |
| The return value of the function is in any case a valid string.  Either
 | |
| it is a translation from a message catalog or it is the same as the
 | |
| @var{string} parameter.  So a piece of code to decide whether a
 | |
| translation actually happened must look like this:
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   char *trans = catgets (desc, set, msg, input_string);
 | |
|   if (trans == input_string)
 | |
|     @{
 | |
|       /* Something went wrong.  */
 | |
|     @}
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| When an error occurred the global variable @var{errno} is set to
 | |
| 
 | |
| @table @var
 | |
| @item EBADF
 | |
| The catalog does not exist.
 | |
| @item ENOMSG
 | |
| The set/message tuple does not name an existing element in the
 | |
| message catalog.
 | |
| @end table
 | |
| 
 | |
| While it sometimes can be useful to test for errors programs normally
 | |
| will avoid any test.  If the translation is not available it is no big
 | |
| problem if the original, untranslated message is printed.  Either the
 | |
| user understands this as well or s/he will look for the reason why the
 | |
| messages are not translated.
 | |
| @end deftypefun
 | |
| 
 | |
| Please note that the currently selected locale does not depend on a call
 | |
| to the @code{setlocale} function.  It is not necessary that the locale
 | |
| data files for this locale exist and calling @code{setlocale} succeeds.
 | |
| The @code{catopen} function directly reads the values of the environment
 | |
| variables.
 | |
| 
 | |
| 
 | |
| @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
 | |
| The function @code{catgets} has to be used to access the massage catalog
 | |
| previously opened using the @code{catopen} function.  The
 | |
| @var{catalog_desc} parameter must be a value previously returned by
 | |
| @code{catopen}.
 | |
| 
 | |
| The next two parameters, @var{set} and @var{message}, reflect the
 | |
| internal organization of the message catalog files.  This will be
 | |
| explained in detail below.  For now it is interesting to know that a
 | |
| catalog can consists of several set and the messages in each thread are
 | |
| individually numbered using numbers.  Neither the set number nor the
 | |
| message number must be consecutive.  They can be arbitrarily chosen.
 | |
| But each message (unless equal to another one) must have its own unique
 | |
| pair of set and message number.
 | |
| 
 | |
| Since it is not guaranteed that the message catalog for the language
 | |
| selected by the user exists the last parameter @var{string} helps to
 | |
| handle this case gracefully.  If no matching string can be found
 | |
| @var{string} is returned.  This means for the programmer that
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| the @var{string} parameters should contain reasonable text (this also
 | |
| helps to understand the program seems otherwise there would be no hint
 | |
| on the string which is expected to be returned.
 | |
| @item
 | |
| all @var{string} arguments should be written in the same language.
 | |
| @end itemize
 | |
| @end deftypefun
 | |
| 
 | |
| It is somewhat uncomfortable to write a program using the @code{catgets}
 | |
| functions if no supporting functionality is available.  Since each
 | |
| set/message number tuple must be unique the programmer must keep lists
 | |
| of the messages at the same time the code is written.  And the work
 | |
| between several people working on the same project must be coordinated.
 | |
| We will see some how these problems can be relaxed a bit (@pxref{Common
 | |
| Usage}).
 | |
| 
 | |
| @deftypefun int catclose (nl_catd @var{catalog_desc})
 | |
| The @code{catclose} function can be used to free the resources
 | |
| associated with a message catalog which previously was opened by a call
 | |
| to @code{catopen}.  If the resources can be successfully freed the
 | |
| function returns @code{0}.  Otherwise it return @code{@minus{}1} and the
 | |
| global variable @var{errno} is set.  Errors can occur if the catalog
 | |
| descriptor @var{catalog_desc} is not valid in which case @var{errno} is
 | |
| set to @code{EBADF}.
 | |
| @end deftypefun
 | |
| 
 | |
| 
 | |
| @node The message catalog files
 | |
| @subsection  Format of the message catalog files
 | |
| 
 | |
| The only reasonable way the translate all the messages of a function and
 | |
| store the result in a message catalog file which can be read by the
 | |
| @code{catopen} function is to write all the message text to the
 | |
| translator and let her/him translate them all.  I.e., we must have a
 | |
| file with entries which associate the set/message tuple with a specific
 | |
| translation.  This file format is specified in the X/Open standard and
 | |
| is as follows:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| Lines containing only whitespace characters or empty lines are ignored.
 | |
| 
 | |
| @item
 | |
| Lines which contain as the first non-whitespace character a @code{$}
 | |
| followed by a whitespace character are comment and are also ignored.
 | |
| 
 | |
| @item
 | |
| If a line contains as the first non-whitespace characters the sequence
 | |
| @code{$set} followed by a whitespace character an additional argument
 | |
| is required to follow.  This argument can either be:
 | |
| 
 | |
| @itemize @minus
 | |
| @item
 | |
| a number.  In this case the value of this number determines the set
 | |
| to which the following messages are added.
 | |
| 
 | |
| @item
 | |
| an identifier consisting of alphanumeric characters plus the underscore
 | |
| character.  In this case the set get automatically a number assigned.
 | |
| This value is one added to the largest set number which so far appeared.
 | |
| 
 | |
| How to use the symbolic names is explained in section @ref{Common Usage}.
 | |
| 
 | |
| It is an error if a symbol name appears more than once.  All following
 | |
| messages are placed in a set with this number.
 | |
| @end itemize
 | |
| 
 | |
| @item
 | |
| If a line contains as the first non-whitespace characters the sequence
 | |
| @code{$delset} followed by a whitespace character an additional argument
 | |
| is required to follow.  This argument can either be:
 | |
| 
 | |
| @itemize @minus
 | |
| @item
 | |
| a number.  In this case the value of this number determines the set
 | |
| which will be deleted.
 | |
| 
 | |
| @item
 | |
| an identifier consisting of alphanumeric characters plus the underscore
 | |
| character.  This symbolic identifier must match a name for a set which
 | |
| previously was defined.  It is an error if the name is unknown.
 | |
| @end itemize
 | |
| 
 | |
| In both cases all messages in the specified set will be removed.  They
 | |
| will not appear in the output.  But if this set is later again selected
 | |
| with a @code{$set} command again messages could be added and these
 | |
| messages will appear in the output.
 | |
| 
 | |
| @item
 | |
| If a line contains after leading whitespaces the sequence
 | |
| @code{$quote}, the quoting character used for this input file is
 | |
| changed to the first non-whitespace character following the
 | |
| @code{$quote}.  If no non-whitespace character is present before the
 | |
| line ends quoting is disable.
 | |
| 
 | |
| By default no quoting character is used.  In this mode strings are
 | |
| terminated with the first unescaped line break.  If there is a
 | |
| @code{$quote} sequence present newline need not be escaped.  Instead a
 | |
| string is terminated with the first unescaped appearance of the quote
 | |
| character.
 | |
| 
 | |
| A common usage of this feature would be to set the quote character to
 | |
| @code{"}.  Then any appearance of the @code{"} in the strings must
 | |
| be escaped using the backslash (i.e., @code{\"} must be written).
 | |
| 
 | |
| @item
 | |
| Any other line must start with a number or an alphanumeric identifier
 | |
| (with the underscore character included).  The following characters
 | |
| (starting after the first whitespace character) will form the string
 | |
| which gets associated with the currently selected set and the message
 | |
| number represented by the number and identifier respectively.
 | |
| 
 | |
| If the start of the line is a number the message number is obvious.  It
 | |
| is an error if the same message number already appeared for this set.
 | |
| 
 | |
| If the leading token was an identifier the message number gets
 | |
| automatically assigned.  The value is the current maximum messages
 | |
| number for this set plus one.  It is an error if the identifier was
 | |
| already used for a message in this set.  It is OK to reuse the
 | |
| identifier for a message in another thread.  How to use the symbolic
 | |
| identifiers will be explained below (@pxref{Common Usage}).  There is
 | |
| one limitation with the identifier: it must not be @code{Set}.  The
 | |
| reason will be explained below.
 | |
| 
 | |
| The text of the messages can contain escape characters.  The usual bunch
 | |
| of characters known from the @w{ISO C} language are recognized
 | |
| (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
 | |
| @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
 | |
| a character code).
 | |
| @end itemize
 | |
| 
 | |
| @strong{Important:} The handling of identifiers instead of numbers for
 | |
| the set and messages is a GNU extension.  Systems strictly following the
 | |
| X/Open specification do not have this feature.  An example for a message
 | |
| catalog file is this:
 | |
| 
 | |
| @smallexample
 | |
| $ This is a leading comment.
 | |
| $quote "
 | |
| 
 | |
| $set SetOne
 | |
| 1 Message with ID 1.
 | |
| two "   Message with ID \"two\", which gets the value 2 assigned"
 | |
| 
 | |
| $set SetTwo
 | |
| $ Since the last set got the number 1 assigned this set has number 2.
 | |
| 4000 "The numbers can be arbitrary, they need not start at one."
 | |
| @end smallexample
 | |
| 
 | |
| This small example shows various aspects:
 | |
| @itemize @bullet
 | |
| @item
 | |
| Lines 1 and 9 are comments since they start with @code{$} followed by
 | |
| a whitespace.
 | |
| @item
 | |
| The quoting character is set to @code{"}.  Otherwise the quotes in the
 | |
| message definition would have to be left away and in this case the
 | |
| message with the identifier @code{two} would loose its leading whitespace.
 | |
| @item
 | |
| Mixing numbered messages with message having symbolic names is no
 | |
| problem and the numbering happens automatically.
 | |
| @end itemize
 | |
| 
 | |
| 
 | |
| While this file format is pretty easy it is not the best possible for
 | |
| use in a running program.  The @code{catopen} function would have to
 | |
| parser the file and handle syntactic errors gracefully.  This is not so
 | |
| easy and the whole process is pretty slow.  Therefore the @code{catgets}
 | |
| functions expect the data in another more compact and ready-to-use file
 | |
| format.  There is a special program @code{gencat} which is explained in
 | |
| detail in the next section.
 | |
| 
 | |
| Files in this other format are not human readable.  To be easy to use by
 | |
| programs it is a binary file.  But the format is byte order independent
 | |
| so translation files can be shared by systems of arbitrary architecture
 | |
| (as long as they use @theglibc{}).
 | |
| 
 | |
| Details about the binary file format are not important to know since
 | |
| these files are always created by the @code{gencat} program.  The
 | |
| sources of @theglibc{} also provide the sources for the
 | |
| @code{gencat} program and so the interested reader can look through
 | |
| these source files to learn about the file format.
 | |
| 
 | |
| 
 | |
| @node The gencat program
 | |
| @subsection Generate Message Catalogs files
 | |
| 
 | |
| @cindex gencat
 | |
| The @code{gencat} program is specified in the X/Open standard and the
 | |
| GNU implementation follows this specification and so processes
 | |
| all correctly formed input files.  Additionally some extension are
 | |
| implemented which help to work in a more reasonable way with the
 | |
| @code{catgets} functions.
 | |
| 
 | |
| The @code{gencat} program can be invoked in two ways:
 | |
| 
 | |
| @example
 | |
| `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
 | |
| @end example
 | |
| 
 | |
| This is the interface defined in the X/Open standard.  If no
 | |
| @var{Input-File} parameter is given input will be read from standard
 | |
| input.  Multiple input files will be read as if they are concatenated.
 | |
| If @var{Output-File} is also missing, the output will be written to
 | |
| standard output.  To provide the interface one is used to from other
 | |
| programs a second interface is provided.
 | |
| 
 | |
| @smallexample
 | |
| `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
 | |
| @end smallexample
 | |
| 
 | |
| The option @samp{-o} is used to specify the output file and all file
 | |
| arguments are used as input files.
 | |
| 
 | |
| Beside this one can use @file{-} or @file{/dev/stdin} for
 | |
| @var{Input-File} to denote the standard input.  Corresponding one can
 | |
| use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
 | |
| standard output.  Using @file{-} as a file name is allowed in X/Open
 | |
| while using the device names is a GNU extension.
 | |
| 
 | |
| The @code{gencat} program works by concatenating all input files and
 | |
| then @strong{merge} the resulting collection of message sets with a
 | |
| possibly existing output file.  This is done by removing all messages
 | |
| with set/message number tuples matching any of the generated messages
 | |
| from the output file and then adding all the new messages.  To
 | |
| regenerate a catalog file while ignoring the old contents therefore
 | |
| requires to remove the output file if it exists.  If the output is
 | |
| written to standard output no merging takes place.
 | |
| 
 | |
| @noindent
 | |
| The following table shows the options understood by the @code{gencat}
 | |
| program.  The X/Open standard does not specify any option for the
 | |
| program so all of these are GNU extensions.
 | |
| 
 | |
| @table @samp
 | |
| @item -V
 | |
| @itemx --version
 | |
| Print the version information and exit.
 | |
| @item -h
 | |
| @itemx --help
 | |
| Print a usage message listing all available options, then exit successfully.
 | |
| @item --new
 | |
| Do never merge the new messages from the input files with the old content
 | |
| of the output files.  The old content of the output file is discarded.
 | |
| @item -H
 | |
| @itemx --header=name
 | |
| This option is used to emit the symbolic names given to sets and
 | |
| messages in the input files for use in the program.  Details about how
 | |
| to use this are given in the next section.  The @var{name} parameter to
 | |
| this option specifies the name of the output file.  It will contain a
 | |
| number of C preprocessor @code{#define}s to associate a name with a
 | |
| number.
 | |
| 
 | |
| Please note that the generated file only contains the symbols from the
 | |
| input files.  If the output is merged with the previous content of the
 | |
| output file the possibly existing symbols from the file(s) which
 | |
| generated the old output files are not in the generated header file.
 | |
| @end table
 | |
| 
 | |
| 
 | |
| @node Common Usage
 | |
| @subsection How to use the @code{catgets} interface
 | |
| 
 | |
| The @code{catgets} functions can be used in two different ways.  By
 | |
| following slavishly the X/Open specs and not relying on the extension
 | |
| and by using the GNU extensions.  We will take a look at the former
 | |
| method first to understand the benefits of extensions.
 | |
| 
 | |
| @subsubsection Not using symbolic names
 | |
| 
 | |
| Since the X/Open format of the message catalog files does not allow
 | |
| symbol names we have to work with numbers all the time.  When we start
 | |
| writing a program we have to replace all appearances of translatable
 | |
| strings with something like
 | |
| 
 | |
| @smallexample
 | |
| catgets (catdesc, set, msg, "string")
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| @var{catgets} is retrieved from a call to @code{catopen} which is
 | |
| normally done once at the program start.  The @code{"string"} is the
 | |
| string we want to translate.  The problems start with the set and
 | |
| message numbers.
 | |
| 
 | |
| In a bigger program several programmers usually work at the same time on
 | |
| the program and so coordinating the number allocation is crucial.
 | |
| Though no two different strings must be indexed by the same tuple of
 | |
| numbers it is highly desirable to reuse the numbers for equal strings
 | |
| with equal translations (please note that there might be strings which
 | |
| are equal in one language but have different translations due to
 | |
| difference contexts).
 | |
| 
 | |
| The allocation process can be relaxed a bit by different set numbers for
 | |
| different parts of the program.  So the number of developers who have to
 | |
| coordinate the allocation can be reduced.  But still lists must be keep
 | |
| track of the allocation and errors can easily happen.  These errors
 | |
| cannot be discovered by the compiler or the @code{catgets} functions.
 | |
| Only the user of the program might see wrong messages printed.  In the
 | |
| worst cases the messages are so irritating that they cannot be
 | |
| recognized as wrong.  Think about the translations for @code{"true"} and
 | |
| @code{"false"} being exchanged.  This could result in a disaster.
 | |
| 
 | |
| 
 | |
| @subsubsection Using symbolic names
 | |
| 
 | |
| The problems mentioned in the last section derive from the fact that:
 | |
| 
 | |
| @enumerate
 | |
| @item
 | |
| the numbers are allocated once and due to the possibly frequent use of
 | |
| them it is difficult to change a number later.
 | |
| @item
 | |
| the numbers do not allow to guess anything about the string and
 | |
| therefore collisions can easily happen.
 | |
| @end enumerate
 | |
| 
 | |
| By constantly using symbolic names and by providing a method which maps
 | |
| the string content to a symbolic name (however this will happen) one can
 | |
| prevent both problems above.  The cost of this is that the programmer
 | |
| has to write a complete message catalog file while s/he is writing the
 | |
| program itself.
 | |
| 
 | |
| This is necessary since the symbolic names must be mapped to numbers
 | |
| before the program sources can be compiled.  In the last section it was
 | |
| described how to generate a header containing the mapping of the names.
 | |
| E.g., for the example message file given in the last section we could
 | |
| call the @code{gencat} program as follow (assume @file{ex.msg} contains
 | |
| the sources).
 | |
| 
 | |
| @smallexample
 | |
| gencat -H ex.h -o ex.cat ex.msg
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| This generates a header file with the following content:
 | |
| 
 | |
| @smallexample
 | |
| #define SetTwoSet 0x2   /* ex.msg:8 */
 | |
| 
 | |
| #define SetOneSet 0x1   /* ex.msg:4 */
 | |
| #define SetOnetwo 0x2   /* ex.msg:6 */
 | |
| @end smallexample
 | |
| 
 | |
| As can be seen the various symbols given in the source file are mangled
 | |
| to generate unique identifiers and these identifiers get numbers
 | |
| assigned.  Reading the source file and knowing about the rules will
 | |
| allow to predict the content of the header file (it is deterministic)
 | |
| but this is not necessary.  The @code{gencat} program can take care for
 | |
| everything.  All the programmer has to do is to put the generated header
 | |
| file in the dependency list of the source files of her/his project and
 | |
| to add a rules to regenerate the header of any of the input files
 | |
| change.
 | |
| 
 | |
| One word about the symbol mangling.  Every symbol consists of two parts:
 | |
| the name of the message set plus the name of the message or the special
 | |
| string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
 | |
| access the translation with identifier @code{two} in the message set
 | |
| @code{SetOne}.
 | |
| 
 | |
| The other names denote the names of the message sets.  The special
 | |
| string @code{Set} is used in the place of the message identifier.
 | |
| 
 | |
| If in the code the second string of the set @code{SetOne} is used the C
 | |
| code should look like this:
 | |
| 
 | |
| @smallexample
 | |
| catgets (catdesc, SetOneSet, SetOnetwo,
 | |
|          "   Message with ID \"two\", which gets the value 2 assigned")
 | |
| @end smallexample
 | |
| 
 | |
| Writing the function this way will allow to change the message number
 | |
| and even the set number without requiring any change in the C source
 | |
| code.  (The text of the string is normally not the same; this is only
 | |
| for this example.)
 | |
| 
 | |
| 
 | |
| @subsubsection How does to this allow to develop
 | |
| 
 | |
| To illustrate the usual way to work with the symbolic version numbers
 | |
| here is a little example.  Assume we want to write the very complex and
 | |
| famous greeting program.  We start by writing the code as usual:
 | |
| 
 | |
| @smallexample
 | |
| #include <stdio.h>
 | |
| int
 | |
| main (void)
 | |
| @{
 | |
|   printf ("Hello, world!\n");
 | |
|   return 0;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| Now we want to internationalize the message and therefore replace the
 | |
| message with whatever the user wants.
 | |
| 
 | |
| @smallexample
 | |
| #include <nl_types.h>
 | |
| #include <stdio.h>
 | |
| #include "msgnrs.h"
 | |
| int
 | |
| main (void)
 | |
| @{
 | |
|   nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
 | |
|   printf (catgets (catdesc, SetMainSet, SetMainHello,
 | |
|                    "Hello, world!\n"));
 | |
|   catclose (catdesc);
 | |
|   return 0;
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| We see how the catalog object is opened and the returned descriptor used
 | |
| in the other function calls.  It is not really necessary to check for
 | |
| failure of any of the functions since even in these situations the
 | |
| functions will behave reasonable.  They simply will be return a
 | |
| translation.
 | |
| 
 | |
| What remains unspecified here are the constants @code{SetMainSet} and
 | |
| @code{SetMainHello}.  These are the symbolic names describing the
 | |
| message.  To get the actual definitions which match the information in
 | |
| the catalog file we have to create the message catalog source file and
 | |
| process it using the @code{gencat} program.
 | |
| 
 | |
| @smallexample
 | |
| $ Messages for the famous greeting program.
 | |
| $quote "
 | |
| 
 | |
| $set Main
 | |
| Hello "Hallo, Welt!\n"
 | |
| @end smallexample
 | |
| 
 | |
| Now we can start building the program (assume the message catalog source
 | |
| file is named @file{hello.msg} and the program source file @file{hello.c}):
 | |
| 
 | |
| @smallexample
 | |
| % gencat -H msgnrs.h -o hello.cat hello.msg
 | |
| % cat msgnrs.h
 | |
| #define MainSet 0x1     /* hello.msg:4 */
 | |
| #define MainHello 0x1   /* hello.msg:5 */
 | |
| % gcc -o hello hello.c -I.
 | |
| % cp hello.cat /usr/share/locale/de/LC_MESSAGES
 | |
| % echo $LC_ALL
 | |
| de
 | |
| % ./hello
 | |
| Hallo, Welt!
 | |
| %
 | |
| @end smallexample
 | |
| 
 | |
| The call of the @code{gencat} program creates the missing header file
 | |
| @file{msgnrs.h} as well as the message catalog binary.  The former is
 | |
| used in the compilation of @file{hello.c} while the later is placed in a
 | |
| directory in which the @code{catopen} function will try to locate it.
 | |
| Please check the @code{LC_ALL} environment variable and the default path
 | |
| for @code{catopen} presented in the description above.
 | |
| 
 | |
| 
 | |
| @node The Uniforum approach
 | |
| @section The Uniforum approach to Message Translation
 | |
| 
 | |
| Sun Microsystems tried to standardize a different approach to message
 | |
| translation in the Uniforum group.  There never was a real standard
 | |
| defined but still the interface was used in Sun's operating systems.
 | |
| Since this approach fits better in the development process of free
 | |
| software it is also used throughout the GNU project and the GNU
 | |
| @file{gettext} package provides support for this outside @theglibc{}.
 | |
| 
 | |
| The code of the @file{libintl} from GNU @file{gettext} is the same as
 | |
| the code in @theglibc{}.  So the documentation in the GNU
 | |
| @file{gettext} manual is also valid for the functionality here.  The
 | |
| following text will describe the library functions in detail.  But the
 | |
| numerous helper programs are not described in this manual.  Instead
 | |
| people should read the GNU @file{gettext} manual
 | |
| (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
 | |
| We will only give a short overview.
 | |
| 
 | |
| Though the @code{catgets} functions are available by default on more
 | |
| systems the @code{gettext} interface is at least as portable as the
 | |
| former.  The GNU @file{gettext} package can be used wherever the
 | |
| functions are not available.
 | |
| 
 | |
| 
 | |
| @menu
 | |
| * Message catalogs with gettext::  The @code{gettext} family of functions.
 | |
| * Helper programs for gettext::    Programs to handle message catalogs
 | |
|                                     for @code{gettext}.
 | |
| @end menu
 | |
| 
 | |
| 
 | |
| @node Message catalogs with gettext
 | |
| @subsection The @code{gettext} family of functions
 | |
| 
 | |
| The paradigms underlying the @code{gettext} approach to message
 | |
| translations is different from that of the @code{catgets} functions the
 | |
| basic functionally is equivalent.  There are functions of the following
 | |
| categories:
 | |
| 
 | |
| @menu
 | |
| * Translation with gettext::       What has to be done to translate a message.
 | |
| * Locating gettext catalog::       How to determine which catalog to be used.
 | |
| * Advanced gettext functions::     Additional functions for more complicated
 | |
|                                     situations.
 | |
| * Charset conversion in gettext::  How to specify the output character set
 | |
|                                     @code{gettext} uses.
 | |
| * GUI program problems::           How to use @code{gettext} in GUI programs.
 | |
| * Using gettextized software::     The possibilities of the user to influence
 | |
|                                     the way @code{gettext} works.
 | |
| @end menu
 | |
| 
 | |
| @node Translation with gettext
 | |
| @subsubsection What has to be done to translate a message?
 | |
| 
 | |
| The @code{gettext} functions have a very simple interface.  The most
 | |
| basic function just takes the string which shall be translated as the
 | |
| argument and it returns the translation.  This is fundamentally
 | |
| different from the @code{catgets} approach where an extra key is
 | |
| necessary and the original string is only used for the error case.
 | |
| 
 | |
| If the string which has to be translated is the only argument this of
 | |
| course means the string itself is the key.  I.e., the translation will
 | |
| be selected based on the original string.  The message catalogs must
 | |
| therefore contain the original strings plus one translation for any such
 | |
| string.  The task of the @code{gettext} function is it to compare the
 | |
| argument string with the available strings in the catalog and return the
 | |
| appropriate translation.  Of course this process is optimized so that
 | |
| this process is not more expensive than an access using an atomic key
 | |
| like in @code{catgets}.
 | |
| 
 | |
| The @code{gettext} approach has some advantages but also some
 | |
| disadvantages.  Please see the GNU @file{gettext} manual for a detailed
 | |
| discussion of the pros and cons.
 | |
| 
 | |
| All the definitions and declarations for @code{gettext} can be found in
 | |
| the @file{libintl.h} header file.  On systems where these functions are
 | |
| not part of the C library they can be found in a separate library named
 | |
| @file{libintl.a} (or accordingly different for shared libraries).
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} gettext (const char *@var{msgid})
 | |
| The @code{gettext} function searches the currently selected message
 | |
| catalogs for a string which is equal to @var{msgid}.  If there is such a
 | |
| string available it is returned.  Otherwise the argument string
 | |
| @var{msgid} is returned.
 | |
| 
 | |
| Please note that all though the return value is @code{char *} the
 | |
| returned string must not be changed.  This broken type results from the
 | |
| history of the function and does not reflect the way the function should
 | |
| be used.
 | |
| 
 | |
| Please note that above we wrote ``message catalogs'' (plural).  This is
 | |
| a specialty of the GNU implementation of these functions and we will
 | |
| say more about this when we talk about the ways message catalogs are
 | |
| selected (@pxref{Locating gettext catalog}).
 | |
| 
 | |
| The @code{gettext} function does not modify the value of the global
 | |
| @var{errno} variable.  This is necessary to make it possible to write
 | |
| something like
 | |
| 
 | |
| @smallexample
 | |
|   printf (gettext ("Operation failed: %m\n"));
 | |
| @end smallexample
 | |
| 
 | |
| Here the @var{errno} value is used in the @code{printf} function while
 | |
| processing the @code{%m} format element and if the @code{gettext}
 | |
| function would change this value (it is called before @code{printf} is
 | |
| called) we would get a wrong message.
 | |
| 
 | |
| So there is no easy way to detect a missing message catalog beside
 | |
| comparing the argument string with the result.  But it is normally the
 | |
| task of the user to react on missing catalogs.  The program cannot guess
 | |
| when a message catalog is really necessary since for a user who speaks
 | |
| the language the program was developed in does not need any translation.
 | |
| @end deftypefun
 | |
| 
 | |
| The remaining two functions to access the message catalog add some
 | |
| functionality to select a message catalog which is not the default one.
 | |
| This is important if parts of the program are developed independently.
 | |
| Every part can have its own message catalog and all of them can be used
 | |
| at the same time.  The C library itself is an example: internally it
 | |
| uses the @code{gettext} functions but since it must not depend on a
 | |
| currently selected default message catalog it must specify all ambiguous
 | |
| information.
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
 | |
| The @code{dgettext} functions acts just like the @code{gettext}
 | |
| function.  It only takes an additional first argument @var{domainname}
 | |
| which guides the selection of the message catalogs which are searched
 | |
| for the translation.  If the @var{domainname} parameter is the null
 | |
| pointer the @code{dgettext} function is exactly equivalent to
 | |
| @code{gettext} since the default value for the domain name is used.
 | |
| 
 | |
| As for @code{gettext} the return value type is @code{char *} which is an
 | |
| anachronism.  The returned string must never be modified.
 | |
| @end deftypefun
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
 | |
| The @code{dcgettext} adds another argument to those which
 | |
| @code{dgettext} takes.  This argument @var{category} specifies the last
 | |
| piece of information needed to localize the message catalog.  I.e., the
 | |
| domain name and the locale category exactly specify which message
 | |
| catalog has to be used (relative to a given directory, see below).
 | |
| 
 | |
| The @code{dgettext} function can be expressed in terms of
 | |
| @code{dcgettext} by using
 | |
| 
 | |
| @smallexample
 | |
| dcgettext (domain, string, LC_MESSAGES)
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| instead of
 | |
| 
 | |
| @smallexample
 | |
| dgettext (domain, string)
 | |
| @end smallexample
 | |
| 
 | |
| This also shows which values are expected for the third parameter.  One
 | |
| has to use the available selectors for the categories available in
 | |
| @file{locale.h}.  Normally the available values are @code{LC_CTYPE},
 | |
| @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
 | |
| @code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
 | |
| must not be used and even though the names might suggest this, there is
 | |
| no relation to the environments variables of this name.
 | |
| 
 | |
| The @code{dcgettext} function is only implemented for compatibility with
 | |
| other systems which have @code{gettext} functions.  There is not really
 | |
| any situation where it is necessary (or useful) to use a different value
 | |
| but @code{LC_MESSAGES} in for the @var{category} parameter.  We are
 | |
| dealing with messages here and any other choice can only be irritating.
 | |
| 
 | |
| As for @code{gettext} the return value type is @code{char *} which is an
 | |
| anachronism.  The returned string must never be modified.
 | |
| @end deftypefun
 | |
| 
 | |
| When using the three functions above in a program it is a frequent case
 | |
| that the @var{msgid} argument is a constant string.  So it is worth to
 | |
| optimize this case.  Thinking shortly about this one will realize that
 | |
| as long as no new message catalog is loaded the translation of a message
 | |
| will not change.  This optimization is actually implemented by the
 | |
| @code{gettext}, @code{dgettext} and @code{dcgettext} functions.
 | |
| 
 | |
| 
 | |
| @node Locating gettext catalog
 | |
| @subsubsection How to determine which catalog to be used
 | |
| 
 | |
| The functions to retrieve the translations for a given message have a
 | |
| remarkable simple interface.  But to provide the user of the program
 | |
| still the opportunity to select exactly the translation s/he wants and
 | |
| also to provide the programmer the possibility to influence the way to
 | |
| locate the search for catalogs files there is a quite complicated
 | |
| underlying mechanism which controls all this.  The code is complicated
 | |
| the use is easy.
 | |
| 
 | |
| Basically we have two different tasks to perform which can also be
 | |
| performed by the @code{catgets} functions:
 | |
| 
 | |
| @enumerate
 | |
| @item
 | |
| Locate the set of message catalogs.  There are a number of files for
 | |
| different languages and which all belong to the package.  Usually they
 | |
| are all stored in the filesystem below a certain directory.
 | |
| 
 | |
| There can be arbitrary many packages installed and they can follow
 | |
| different guidelines for the placement of their files.
 | |
| 
 | |
| @item
 | |
| Relative to the location specified by the package the actual translation
 | |
| files must be searched, based on the wishes of the user.  I.e., for each
 | |
| language the user selects the program should be able to locate the
 | |
| appropriate file.
 | |
| @end enumerate
 | |
| 
 | |
| This is the functionality required by the specifications for
 | |
| @code{gettext} and this is also what the @code{catgets} functions are
 | |
| able to do.  But there are some problems unresolved:
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| The language to be used can be specified in several different ways.
 | |
| There is no generally accepted standard for this and the user always
 | |
| expects the program understand what s/he means.  E.g., to select the
 | |
| German translation one could write @code{de}, @code{german}, or
 | |
| @code{deutsch} and the program should always react the same.
 | |
| 
 | |
| @item
 | |
| Sometimes the specification of the user is too detailed.  If s/he, e.g.,
 | |
| specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
 | |
| coded using the @w{ISO 8859-1} character set there is the possibility
 | |
| that a message catalog matching this exactly is not available.  But
 | |
| there could be a catalog matching @code{de} and if the character set
 | |
| used on the machine is always @w{ISO 8859-1} there is no reason why this
 | |
| later message catalog should not be used.  (We call this @dfn{message
 | |
| inheritance}.)
 | |
| 
 | |
| @item
 | |
| If a catalog for a wanted language is not available it is not always the
 | |
| second best choice to fall back on the language of the developer and
 | |
| simply not translate any message.  Instead a user might be better able
 | |
| to read the messages in another language and so the user of the program
 | |
| should be able to define a precedence order of languages.
 | |
| @end itemize
 | |
| 
 | |
| We can divide the configuration actions in two parts: the one is
 | |
| performed by the programmer, the other by the user.  We will start with
 | |
| the functions the programmer can use since the user configuration will
 | |
| be based on this.
 | |
| 
 | |
| As the functions described in the last sections already mention separate
 | |
| sets of messages can be selected by a @dfn{domain name}.  This is a
 | |
| simple string which should be unique for each program part with uses a
 | |
| separate domain.  It is possible to use in one program arbitrary many
 | |
| domains at the same time.  E.g., @theglibc{} itself uses a domain
 | |
| named @code{libc} while the program using the C Library could use a
 | |
| domain named @code{foo}.  The important point is that at any time
 | |
| exactly one domain is active.  This is controlled with the following
 | |
| function.
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} textdomain (const char *@var{domainname})
 | |
| The @code{textdomain} function sets the default domain, which is used in
 | |
| all future @code{gettext} calls, to @var{domainname}.  Please note that
 | |
| @code{dgettext} and @code{dcgettext} calls are not influenced if the
 | |
| @var{domainname} parameter of these functions is not the null pointer.
 | |
| 
 | |
| Before the first call to @code{textdomain} the default domain is
 | |
| @code{messages}.  This is the name specified in the specification of
 | |
| the @code{gettext} API.  This name is as good as any other name.  No
 | |
| program should ever really use a domain with this name since this can
 | |
| only lead to problems.
 | |
| 
 | |
| The function returns the value which is from now on taken as the default
 | |
| domain.  If the system went out of memory the returned value is
 | |
| @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
 | |
| Despite the return value type being @code{char *} the return string must
 | |
| not be changed.  It is allocated internally by the @code{textdomain}
 | |
| function.
 | |
| 
 | |
| If the @var{domainname} parameter is the null pointer no new default
 | |
| domain is set.  Instead the currently selected default domain is
 | |
| returned.
 | |
| 
 | |
| If the @var{domainname} parameter is the empty string the default domain
 | |
| is reset to its initial value, the domain with the name @code{messages}.
 | |
| This possibility is questionable to use since the domain @code{messages}
 | |
| really never should be used.
 | |
| @end deftypefun
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
 | |
| The @code{bindtextdomain} function can be used to specify the directory
 | |
| which contains the message catalogs for domain @var{domainname} for the
 | |
| different languages.  To be correct, this is the directory where the
 | |
| hierarchy of directories is expected.  Details are explained below.
 | |
| 
 | |
| For the programmer it is important to note that the translations which
 | |
| come with the program have be placed in a directory hierarchy starting
 | |
| at, say, @file{/foo/bar}.  Then the program should make a
 | |
| @code{bindtextdomain} call to bind the domain for the current program to
 | |
| this directory.  So it is made sure the catalogs are found.  A correctly
 | |
| running program does not depend on the user setting an environment
 | |
| variable.
 | |
| 
 | |
| The @code{bindtextdomain} function can be used several times and if the
 | |
| @var{domainname} argument is different the previously bound domains
 | |
| will not be overwritten.
 | |
| 
 | |
| If the program which wish to use @code{bindtextdomain} at some point of
 | |
| time use the @code{chdir} function to change the current working
 | |
| directory it is important that the @var{dirname} strings ought to be an
 | |
| absolute pathname.  Otherwise the addressed directory might vary with
 | |
| the time.
 | |
| 
 | |
| If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
 | |
| returns the currently selected directory for the domain with the name
 | |
| @var{domainname}.
 | |
| 
 | |
| The @code{bindtextdomain} function returns a pointer to a string
 | |
| containing the name of the selected directory name.  The string is
 | |
| allocated internally in the function and must not be changed by the
 | |
| user.  If the system went out of core during the execution of
 | |
| @code{bindtextdomain} the return value is @code{NULL} and the global
 | |
| variable @var{errno} is set accordingly.
 | |
| @end deftypefun
 | |
| 
 | |
| 
 | |
| @node Advanced gettext functions
 | |
| @subsubsection Additional functions for more complicated situations
 | |
| 
 | |
| The functions of the @code{gettext} family described so far (and all the
 | |
| @code{catgets} functions as well) have one problem in the real world
 | |
| which have been neglected completely in all existing approaches.  What
 | |
| is meant here is the handling of plural forms.
 | |
| 
 | |
| Looking through Unix source code before the time anybody thought about
 | |
| internationalization (and, sadly, even afterwards) one can often find
 | |
| code similar to the following:
 | |
| 
 | |
| @smallexample
 | |
|    printf ("%d file%s deleted", n, n == 1 ? "" : "s");
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| After the first complaints from people internationalizing the code people
 | |
| either completely avoided formulations like this or used strings like
 | |
| @code{"file(s)"}.  Both look unnatural and should be avoided.  First
 | |
| tries to solve the problem correctly looked like this:
 | |
| 
 | |
| @smallexample
 | |
|    if (n == 1)
 | |
|      printf ("%d file deleted", n);
 | |
|    else
 | |
|      printf ("%d files deleted", n);
 | |
| @end smallexample
 | |
| 
 | |
| But this does not solve the problem.  It helps languages where the
 | |
| plural form of a noun is not simply constructed by adding an `s' but
 | |
| that is all.  Once again people fell into the trap of believing the
 | |
| rules their language is using are universal.  But the handling of plural
 | |
| forms differs widely between the language families.  There are two
 | |
| things we can differ between (and even inside language families);
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| The form how plural forms are build differs.  This is a problem with
 | |
| language which have many irregularities.  German, for instance, is a
 | |
| drastic case.  Though English and German are part of the same language
 | |
| family (Germanic), the almost regular forming of plural noun forms
 | |
| (appending an `s') is hardly found in German.
 | |
| 
 | |
| @item
 | |
| The number of plural forms differ.  This is somewhat surprising for
 | |
| those who only have experiences with Romanic and Germanic languages
 | |
| since here the number is the same (there are two).
 | |
| 
 | |
| But other language families have only one form or many forms.  More
 | |
| information on this in an extra section.
 | |
| @end itemize
 | |
| 
 | |
| The consequence of this is that application writers should not try to
 | |
| solve the problem in their code.  This would be localization since it is
 | |
| only usable for certain, hardcoded language environments.  Instead the
 | |
| extended @code{gettext} interface should be used.
 | |
| 
 | |
| These extra functions are taking instead of the one key string two
 | |
| strings and a numerical argument.  The idea behind this is that using
 | |
| the numerical argument and the first string as a key, the implementation
 | |
| can select using rules specified by the translator the right plural
 | |
| form.  The two string arguments then will be used to provide a return
 | |
| value in case no message catalog is found (similar to the normal
 | |
| @code{gettext} behavior).  In this case the rules for Germanic language
 | |
| is used and it is assumed that the first string argument is the singular
 | |
| form, the second the plural form.
 | |
| 
 | |
| This has the consequence that programs without language catalogs can
 | |
| display the correct strings only if the program itself is written using
 | |
| a Germanic language.  This is a limitation but since @theglibc{}
 | |
| (as well as the GNU @code{gettext} package) are written as part of the
 | |
| GNU package and the coding standards for the GNU project require program
 | |
| being written in English, this solution nevertheless fulfills its
 | |
| purpose.
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
 | |
| The @code{ngettext} function is similar to the @code{gettext} function
 | |
| as it finds the message catalogs in the same way.  But it takes two
 | |
| extra arguments.  The @var{msgid1} parameter must contain the singular
 | |
| form of the string to be converted.  It is also used as the key for the
 | |
| search in the catalog.  The @var{msgid2} parameter is the plural form.
 | |
| The parameter @var{n} is used to determine the plural form.  If no
 | |
| message catalog is found @var{msgid1} is returned if @code{n == 1},
 | |
| otherwise @code{msgid2}.
 | |
| 
 | |
| An example for the us of this function is:
 | |
| 
 | |
| @smallexample
 | |
|   printf (ngettext ("%d file removed", "%d files removed", n), n);
 | |
| @end smallexample
 | |
| 
 | |
| Please note that the numeric value @var{n} has to be passed to the
 | |
| @code{printf} function as well.  It is not sufficient to pass it only to
 | |
| @code{ngettext}.
 | |
| @end deftypefun
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
 | |
| The @code{dngettext} is similar to the @code{dgettext} function in the
 | |
| way the message catalog is selected.  The difference is that it takes
 | |
| two extra parameter to provide the correct plural form.  These two
 | |
| parameters are handled in the same way @code{ngettext} handles them.
 | |
| @end deftypefun
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
 | |
| The @code{dcngettext} is similar to the @code{dcgettext} function in the
 | |
| way the message catalog is selected.  The difference is that it takes
 | |
| two extra parameter to provide the correct plural form.  These two
 | |
| parameters are handled in the same way @code{ngettext} handles them.
 | |
| @end deftypefun
 | |
| 
 | |
| @subsubheading The problem of plural forms
 | |
| 
 | |
| A description of the problem can be found at the beginning of the last
 | |
| section.  Now there is the question how to solve it.  Without the input
 | |
| of linguists (which was not available) it was not possible to determine
 | |
| whether there are only a few different forms in which plural forms are
 | |
| formed or whether the number can increase with every new supported
 | |
| language.
 | |
| 
 | |
| Therefore the solution implemented is to allow the translator to specify
 | |
| the rules of how to select the plural form.  Since the formula varies
 | |
| with every language this is the only viable solution except for
 | |
| hardcoding the information in the code (which still would require the
 | |
| possibility of extensions to not prevent the use of new languages).  The
 | |
| details are explained in the GNU @code{gettext} manual.  Here only a
 | |
| bit of information is provided.
 | |
| 
 | |
| The information about the plural form selection has to be stored in the
 | |
| header entry (the one with the empty (@code{msgid} string).  It looks
 | |
| like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
 | |
| @end smallexample
 | |
| 
 | |
| The @code{nplurals} value must be a decimal number which specifies how
 | |
| many different plural forms exist for this language.  The string
 | |
| following @code{plural} is an expression which is using the C language
 | |
| syntax.  Exceptions are that no negative number are allowed, numbers
 | |
| must be decimal, and the only variable allowed is @code{n}.  This
 | |
| expression will be evaluated whenever one of the functions
 | |
| @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called.  The
 | |
| numeric value passed to these functions is then substituted for all uses
 | |
| of the variable @code{n} in the expression.  The resulting value then
 | |
| must be greater or equal to zero and smaller than the value given as the
 | |
| value of @code{nplurals}.
 | |
| 
 | |
| @noindent
 | |
| The following rules are known at this point.  The language with families
 | |
| are listed.  But this does not necessarily mean the information can be
 | |
| generalized for the whole family (as can be easily seen in the table
 | |
| below).@footnote{Additions are welcome.  Send appropriate information to
 | |
| @email{bug-glibc-manual@@gnu.org}.}
 | |
| 
 | |
| @table @asis
 | |
| @item Only one form:
 | |
| Some languages only require one single form.  There is no distinction
 | |
| between the singular and plural form.  An appropriate header entry
 | |
| would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=1; plural=0;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Finno-Ugric family
 | |
| Hungarian
 | |
| @item Asian family
 | |
| Japanese, Korean
 | |
| @item Turkic/Altaic family
 | |
| Turkish
 | |
| @end table
 | |
| 
 | |
| @item Two forms, singular used for one only
 | |
| This is the form used in most existing programs since it is what English
 | |
| is using.  A header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=2; plural=n != 1;
 | |
| @end smallexample
 | |
| 
 | |
| (Note: this uses the feature of C expressions that boolean expressions
 | |
| have to value zero or one.)
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Germanic family
 | |
| Danish, Dutch, English, German, Norwegian, Swedish
 | |
| @item Finno-Ugric family
 | |
| Estonian, Finnish
 | |
| @item Latin/Greek family
 | |
| Greek
 | |
| @item Semitic family
 | |
| Hebrew
 | |
| @item Romance family
 | |
| Italian, Portuguese, Spanish
 | |
| @item Artificial
 | |
| Esperanto
 | |
| @end table
 | |
| 
 | |
| @item Two forms, singular used for zero and one
 | |
| Exceptional case in the language family.  The header entry would be:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=2; plural=n>1;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Romanic family
 | |
| French, Brazilian Portuguese
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special case for zero
 | |
| The header entry would be:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Baltic family
 | |
| Latvian
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special cases for one and two
 | |
| The header entry would be:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Celtic
 | |
| Gaeilge (Irish)
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special case for numbers ending in 1[2-9]
 | |
| The header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; \
 | |
|     plural=n%10==1 && n%100!=11 ? 0 : \
 | |
|            n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Baltic family
 | |
| Lithuanian
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
 | |
| The header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; \
 | |
|     plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Slavic family
 | |
| Croatian, Czech, Russian, Ukrainian
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special cases for 1 and 2, 3, 4
 | |
| The header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; \
 | |
|     plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Slavic family
 | |
| Slovak
 | |
| @end table
 | |
| 
 | |
| @item Three forms, special case for one and some numbers ending in 2, 3, or 4
 | |
| The header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=3; \
 | |
|     plural=n==1 ? 0 : \
 | |
|            n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Slavic family
 | |
| Polish
 | |
| @end table
 | |
| 
 | |
| @item Four forms, special case for one and all numbers ending in 02, 03, or 04
 | |
| The header entry would look like this:
 | |
| 
 | |
| @smallexample
 | |
| Plural-Forms: nplurals=4; \
 | |
|     plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
 | |
| @end smallexample
 | |
| 
 | |
| @noindent
 | |
| Languages with this property include:
 | |
| 
 | |
| @table @asis
 | |
| @item Slavic family
 | |
| Slovenian
 | |
| @end table
 | |
| @end table
 | |
| 
 | |
| 
 | |
| @node Charset conversion in gettext
 | |
| @subsubsection How to specify the output character set @code{gettext} uses
 | |
| 
 | |
| @code{gettext} not only looks up a translation in a message catalog.  It
 | |
| also converts the translation on the fly to the desired output character
 | |
| set.  This is useful if the user is working in a different character set
 | |
| than the translator who created the message catalog, because it avoids
 | |
| distributing variants of message catalogs which differ only in the
 | |
| character set.
 | |
| 
 | |
| The output character set is, by default, the value of @code{nl_langinfo
 | |
| (CODESET)}, which depends on the @code{LC_CTYPE} part of the current
 | |
| locale.  But programs which store strings in a locale independent way
 | |
| (e.g. UTF-8) can request that @code{gettext} and related functions
 | |
| return the translations in that encoding, by use of the
 | |
| @code{bind_textdomain_codeset} function.
 | |
| 
 | |
| Note that the @var{msgid} argument to @code{gettext} is not subject to
 | |
| character set conversion.  Also, when @code{gettext} does not find a
 | |
| translation for @var{msgid}, it returns @var{msgid} unchanged --
 | |
| independently of the current output character set.  It is therefore
 | |
| recommended that all @var{msgid}s be US-ASCII strings.
 | |
| 
 | |
| @comment libintl.h
 | |
| @comment GNU
 | |
| @deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset})
 | |
| The @code{bind_textdomain_codeset} function can be used to specify the
 | |
| output character set for message catalogs for domain @var{domainname}.
 | |
| The @var{codeset} argument must be a valid codeset name which can be used
 | |
| for the @code{iconv_open} function, or a null pointer.
 | |
| 
 | |
| If the @var{codeset} parameter is the null pointer,
 | |
| @code{bind_textdomain_codeset} returns the currently selected codeset
 | |
| for the domain with the name @var{domainname}. It returns @code{NULL} if
 | |
| no codeset has yet been selected.
 | |
| 
 | |
| The @code{bind_textdomain_codeset} function can be used several times.
 | |
| If used multiple times with the same @var{domainname} argument, the
 | |
| later call overrides the settings made by the earlier one.
 | |
| 
 | |
| The @code{bind_textdomain_codeset} function returns a pointer to a
 | |
| string containing the name of the selected codeset.  The string is
 | |
| allocated internally in the function and must not be changed by the
 | |
| user.  If the system went out of core during the execution of
 | |
| @code{bind_textdomain_codeset}, the return value is @code{NULL} and the
 | |
| global variable @var{errno} is set accordingly.
 | |
| @end deftypefun
 | |
| 
 | |
| 
 | |
| @node GUI program problems
 | |
| @subsubsection How to use @code{gettext} in GUI programs
 | |
| 
 | |
| One place where the @code{gettext} functions, if used normally, have big
 | |
| problems is within programs with graphical user interfaces (GUIs).  The
 | |
| problem is that many of the strings which have to be translated are very
 | |
| short.  They have to appear in pull-down menus which restricts the
 | |
| length.  But strings which are not containing entire sentences or at
 | |
| least large fragments of a sentence may appear in more than one
 | |
| situation in the program but might have different translations.  This is
 | |
| especially true for the one-word strings which are frequently used in
 | |
| GUI programs.
 | |
| 
 | |
| As a consequence many people say that the @code{gettext} approach is
 | |
| wrong and instead @code{catgets} should be used which indeed does not
 | |
| have this problem.  But there is a very simple and powerful method to
 | |
| handle these kind of problems with the @code{gettext} functions.
 | |
| 
 | |
| @noindent
 | |
| As an example consider the following fictional situation.  A GUI program
 | |
| has a menu bar with the following entries:
 | |
| 
 | |
| @smallexample
 | |
| +------------+------------+--------------------------------------+
 | |
| | File       | Printer    |                                      |
 | |
| +------------+------------+--------------------------------------+
 | |
| | Open     | | Select   |
 | |
| | New      | | Open     |
 | |
| +----------+ | Connect  |
 | |
|              +----------+
 | |
| @end smallexample
 | |
| 
 | |
| To have the strings @code{File}, @code{Printer}, @code{Open},
 | |
| @code{New}, @code{Select}, and @code{Connect} translated there has to be
 | |
| at some point in the code a call to a function of the @code{gettext}
 | |
| family.  But in two places the string passed into the function would be
 | |
| @code{Open}.  The translations might not be the same and therefore we
 | |
| are in the dilemma described above.
 | |
| 
 | |
| One solution to this problem is to artificially enlengthen the strings
 | |
| to make them unambiguous.  But what would the program do if no
 | |
| translation is available?  The enlengthened string is not what should be
 | |
| printed.  So we should use a little bit modified version of the functions.
 | |
| 
 | |
| To enlengthen the strings a uniform method should be used.  E.g., in the
 | |
| example above the strings could be chosen as
 | |
| 
 | |
| @smallexample
 | |
| Menu|File
 | |
| Menu|Printer
 | |
| Menu|File|Open
 | |
| Menu|File|New
 | |
| Menu|Printer|Select
 | |
| Menu|Printer|Open
 | |
| Menu|Printer|Connect
 | |
| @end smallexample
 | |
| 
 | |
| Now all the strings are different and if now instead of @code{gettext}
 | |
| the following little wrapper function is used, everything works just
 | |
| fine:
 | |
| 
 | |
| @cindex sgettext
 | |
| @smallexample
 | |
|   char *
 | |
|   sgettext (const char *msgid)
 | |
|   @{
 | |
|     char *msgval = gettext (msgid);
 | |
|     if (msgval == msgid)
 | |
|       msgval = strrchr (msgid, '|') + 1;
 | |
|     return msgval;
 | |
|   @}
 | |
| @end smallexample
 | |
| 
 | |
| What this little function does is to recognize the case when no
 | |
| translation is available.  This can be done very efficiently by a
 | |
| pointer comparison since the return value is the input value.  If there
 | |
| is no translation we know that the input string is in the format we used
 | |
| for the Menu entries and therefore contains a @code{|} character.  We
 | |
| simply search for the last occurrence of this character and return a
 | |
| pointer to the character following it.  That's it!
 | |
| 
 | |
| If one now consistently uses the enlengthened string form and replaces
 | |
| the @code{gettext} calls with calls to @code{sgettext} (this is normally
 | |
| limited to very few places in the GUI implementation) then it is
 | |
| possible to produce a program which can be internationalized.
 | |
| 
 | |
| With advanced compilers (such as GNU C) one can write the
 | |
| @code{sgettext} functions as an inline function or as a macro like this:
 | |
| 
 | |
| @cindex sgettext
 | |
| @smallexample
 | |
| #define sgettext(msgid) \
 | |
|   (@{ const char *__msgid = (msgid);            \
 | |
|      char *__msgstr = gettext (__msgid);       \
 | |
|      if (__msgval == __msgid)                  \
 | |
|        __msgval = strrchr (__msgid, '|') + 1;  \
 | |
|      __msgval; @})
 | |
| @end smallexample
 | |
| 
 | |
| The other @code{gettext} functions (@code{dgettext}, @code{dcgettext}
 | |
| and the @code{ngettext} equivalents) can and should have corresponding
 | |
| functions as well which look almost identical, except for the parameters
 | |
| and the call to the underlying function.
 | |
| 
 | |
| Now there is of course the question why such functions do not exist in
 | |
| @theglibc{}?  There are two parts of the answer to this question.
 | |
| 
 | |
| @itemize @bullet
 | |
| @item
 | |
| They are easy to write and therefore can be provided by the project they
 | |
| are used in.  This is not an answer by itself and must be seen together
 | |
| with the second part which is:
 | |
| 
 | |
| @item
 | |
| There is no way the C library can contain a version which can work
 | |
| everywhere.  The problem is the selection of the character to separate
 | |
| the prefix from the actual string in the enlenghtened string.  The
 | |
| examples above used @code{|} which is a quite good choice because it
 | |
| resembles a notation frequently used in this context and it also is a
 | |
| character not often used in message strings.
 | |
| 
 | |
| But what if the character is used in message strings.  Or if the chose
 | |
| character is not available in the character set on the machine one
 | |
| compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is
 | |
| why the @file{iso646.h} file exists in @w{ISO C} programming environments).
 | |
| @end itemize
 | |
| 
 | |
| There is only one more comment to make left.  The wrapper function above
 | |
| require that the translations strings are not enlengthened themselves.
 | |
| This is only logical.  There is no need to disambiguate the strings
 | |
| (since they are never used as keys for a search) and one also saves
 | |
| quite some memory and disk space by doing this.
 | |
| 
 | |
| 
 | |
| @node Using gettextized software
 | |
| @subsubsection User influence on @code{gettext}
 | |
| 
 | |
| The last sections described what the programmer can do to
 | |
| internationalize the messages of the program.  But it is finally up to
 | |
| the user to select the message s/he wants to see.  S/He must understand
 | |
| them.
 | |
| 
 | |
| The POSIX locale model uses the environment variables @code{LC_COLLATE},
 | |
| @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC},
 | |
| and @code{LC_TIME} to select the locale which is to be used.  This way
 | |
| the user can influence lots of functions.  As we mentioned above the
 | |
| @code{gettext} functions also take advantage of this.
 | |
| 
 | |
| To understand how this happens it is necessary to take a look at the
 | |
| various components of the filename which gets computed to locate a
 | |
| message catalog.  It is composed as follows:
 | |
| 
 | |
| @smallexample
 | |
| @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
 | |
| @end smallexample
 | |
| 
 | |
| The default value for @var{dir_name} is system specific.  It is computed
 | |
| from the value given as the prefix while configuring the C library.
 | |
| This value normally is @file{/usr} or @file{/}.  For the former the
 | |
| complete @var{dir_name} is:
 | |
| 
 | |
| @smallexample
 | |
| /usr/share/locale
 | |
| @end smallexample
 | |
| 
 | |
| We can use @file{/usr/share} since the @file{.mo} files containing the
 | |
| message catalogs are system independent, so all systems can use the same
 | |
| files.  If the program executed the @code{bindtextdomain} function for
 | |
| the message domain that is currently handled, the @code{dir_name}
 | |
| component is exactly the value which was given to the function as
 | |
| the second parameter.  I.e., @code{bindtextdomain} allows overwriting
 | |
| the only system dependent and fixed value to make it possible to
 | |
| address files anywhere in the filesystem.
 | |
| 
 | |
| The @var{category} is the name of the locale category which was selected
 | |
| in the program code.  For @code{gettext} and @code{dgettext} this is
 | |
| always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
 | |
| value of the third parameter.  As said above it should be avoided to
 | |
| ever use a category other than @code{LC_MESSAGES}.
 | |
| 
 | |
| The @var{locale} component is computed based on the category used.  Just
 | |
| like for the @code{setlocale} function here comes the user selection
 | |
| into the play.  Some environment variables are examined in a fixed order
 | |
| and the first environment variable set determines the return value of
 | |
| the lookup process.  In detail, for the category @code{LC_xxx} the
 | |
| following variables in this order are examined:
 | |
| 
 | |
| @table @code
 | |
| @item LANGUAGE
 | |
| @item LC_ALL
 | |
| @item LC_xxx
 | |
| @item LANG
 | |
| @end table
 | |
| 
 | |
| This looks very familiar.  With the exception of the @code{LANGUAGE}
 | |
| environment variable this is exactly the lookup order the
 | |
| @code{setlocale} function uses.  But why introducing the @code{LANGUAGE}
 | |
| variable?
 | |
| 
 | |
| The reason is that the syntax of the values these variables can have is
 | |
| different to what is expected by the @code{setlocale} function.  If we
 | |
| would set @code{LC_ALL} to a value following the extended syntax that
 | |
| would mean the @code{setlocale} function will never be able to use the
 | |
| value of this variable as well.  An additional variable removes this
 | |
| problem plus we can select the language independently of the locale
 | |
| setting which sometimes is useful.
 | |
| 
 | |
| While for the @code{LC_xxx} variables the value should consist of
 | |
| exactly one specification of a locale the @code{LANGUAGE} variable's
 | |
| value can consist of a colon separated list of locale names.  The
 | |
| attentive reader will realize that this is the way we manage to
 | |
| implement one of our additional demands above: we want to be able to
 | |
| specify an ordered list of language.
 | |
| 
 | |
| Back to the constructed filename we have only one component missing.
 | |
| The @var{domain_name} part is the name which was either registered using
 | |
| the @code{textdomain} function or which was given to @code{dgettext} or
 | |
| @code{dcgettext} as the first parameter.  Now it becomes obvious that a
 | |
| good choice for the domain name in the program code is a string which is
 | |
| closely related to the program/package name.  E.g., for @theglibc{}
 | |
| the domain name is @code{libc}.
 | |
| 
 | |
| @noindent
 | |
| A limit piece of example code should show how the programmer is supposed
 | |
| to work:
 | |
| 
 | |
| @smallexample
 | |
| @{
 | |
|   setlocale (LC_ALL, "");
 | |
|   textdomain ("test-package");
 | |
|   bindtextdomain ("test-package", "/usr/local/share/locale");
 | |
|   puts (gettext ("Hello, world!"));
 | |
| @}
 | |
| @end smallexample
 | |
| 
 | |
| At the program start the default domain is @code{messages}, and the
 | |
| default locale is "C".  The @code{setlocale} call sets the locale
 | |
| according to the user's environment variables; remember that correct
 | |
| functioning of @code{gettext} relies on the correct setting of the
 | |
| @code{LC_MESSAGES} locale (for looking up the message catalog) and
 | |
| of the @code{LC_CTYPE} locale (for the character set conversion).
 | |
| The @code{textdomain} call changes the default domain to
 | |
| @code{test-package}.  The @code{bindtextdomain} call specifies that
 | |
| the message catalogs for the domain @code{test-package} can be found
 | |
| below the directory @file{/usr/local/share/locale}.
 | |
| 
 | |
| If now the user set in her/his environment the variable @code{LANGUAGE}
 | |
| to @code{de} the @code{gettext} function will try to use the
 | |
| translations from the file
 | |
| 
 | |
| @smallexample
 | |
| /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
 | |
| @end smallexample
 | |
| 
 | |
| From the above descriptions it should be clear which component of this
 | |
| filename is determined by which source.
 | |
| 
 | |
| In the above example we assumed that the @code{LANGUAGE} environment
 | |
| variable to @code{de}.  This might be an appropriate selection but what
 | |
| happens if the user wants to use @code{LC_ALL} because of the wider
 | |
| usability and here the required value is @code{de_DE.ISO-8859-1}?  We
 | |
| already mentioned above that a situation like this is not infrequent.
 | |
| E.g., a person might prefer reading a dialect and if this is not
 | |
| available fall back on the standard language.
 | |
| 
 | |
| The @code{gettext} functions know about situations like this and can
 | |
| handle them gracefully.  The functions recognize the format of the value
 | |
| of the environment variable.  It can split the value is different pieces
 | |
| and by leaving out the only or the other part it can construct new
 | |
| values.  This happens of course in a predictable way.  To understand
 | |
| this one must know the format of the environment variable value.  There
 | |
| is one more or less standardized form, originally from the X/Open
 | |
| specification:
 | |
| 
 | |
| @code{language[_territory[.codeset]][@@modifier]}
 | |
| 
 | |
| Less specific locale names will be stripped of in the order of the
 | |
| following list:
 | |
| 
 | |
| @enumerate
 | |
| @item
 | |
| @code{codeset}
 | |
| @item
 | |
| @code{normalized codeset}
 | |
| @item
 | |
| @code{territory}
 | |
| @item
 | |
| @code{modifier}
 | |
| @end enumerate
 | |
| 
 | |
| The @code{language} field will never be dropped for obvious reasons.
 | |
| 
 | |
| The only new thing is the @code{normalized codeset} entry.  This is
 | |
| another goodie which is introduced to help reducing the chaos which
 | |
| derives from the inability of the people to standardize the names of
 | |
| character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
 | |
| @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
 | |
| codeset} value is generated from the user-provided character set name by
 | |
| applying the following rules:
 | |
| 
 | |
| @enumerate
 | |
| @item
 | |
| Remove all characters beside numbers and letters.
 | |
| @item
 | |
| Fold letters to lowercase.
 | |
| @item
 | |
| If the same only contains digits prepend the string @code{"iso"}.
 | |
| @end enumerate
 | |
| 
 | |
| @noindent
 | |
| So all of the above name will be normalized to @code{iso88591}.  This
 | |
| allows the program user much more freely choosing the locale name.
 | |
| 
 | |
| Even this extended functionality still does not help to solve the
 | |
| problem that completely different names can be used to denote the same
 | |
| locale (e.g., @code{de} and @code{german}).  To be of help in this
 | |
| situation the locale implementation and also the @code{gettext}
 | |
| functions know about aliases.
 | |
| 
 | |
| The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
 | |
| whatever prefix you used for configuring the C library) contains a
 | |
| mapping of alternative names to more regular names.  The system manager
 | |
| is free to add new entries to fill her/his own needs.  The selected
 | |
| locale from the environment is compared with the entries in the first
 | |
| column of this file ignoring the case.  If they match the value of the
 | |
| second column is used instead for the further handling.
 | |
| 
 | |
| In the description of the format of the environment variables we already
 | |
| mentioned the character set as a factor in the selection of the message
 | |
| catalog.  In fact, only catalogs which contain text written using the
 | |
| character set of the system/program can be used (directly; there will
 | |
| come a solution for this some day).  This means for the user that s/he
 | |
| will always have to take care for this.  If in the collection of the
 | |
| message catalogs there are files for the same language but coded using
 | |
| different character sets the user has to be careful.
 | |
| 
 | |
| 
 | |
| @node Helper programs for gettext
 | |
| @subsection Programs to handle message catalogs for @code{gettext}
 | |
| 
 | |
| @Theglibc{} does not contain the source code for the programs to
 | |
| handle message catalogs for the @code{gettext} functions.  As part of
 | |
| the GNU project the GNU gettext package contains everything the
 | |
| developer needs.  The functionality provided by the tools in this
 | |
| package by far exceeds the abilities of the @code{gencat} program
 | |
| described above for the @code{catgets} functions.
 | |
| 
 | |
| There is a program @code{msgfmt} which is the equivalent program to the
 | |
| @code{gencat} program.  It generates from the human-readable and
 | |
| -editable form of the message catalog a binary file which can be used by
 | |
| the @code{gettext} functions.  But there are several more programs
 | |
| available.
 | |
| 
 | |
| The @code{xgettext} program can be used to automatically extract the
 | |
| translatable messages from a source file.  I.e., the programmer need not
 | |
| take care of the translations and the list of messages which have to be
 | |
| translated.  S/He will simply wrap the translatable string in calls to
 | |
| @code{gettext} et.al and the rest will be done by @code{xgettext}.  This
 | |
| program has a lot of options which help to customize the output or
 | |
| help to understand the input better.
 | |
| 
 | |
| Other programs help to manage the development cycle when new messages appear
 | |
| in the source files or when a new translation of the messages appears.
 | |
| Here it should only be noted that using all the tools in GNU gettext it
 | |
| is possible to @emph{completely} automate the handling of message
 | |
| catalogs.  Beside marking the translatable strings in the source code and
 | |
| generating the translations the developers do not have anything to do
 | |
| themselves.
 |