Editing.

author: Ulrich Drepper <drepper@redhat.com> 2001-11-05 08:04:39 +0000
committer: Ulrich Drepper <drepper@redhat.com> 2001-11-05 08:04:39 +0000
commit: 91f07167e37541706554e4117c32aae1bd436cc9 (patch)
tree: 05ece0714b396155a8e923f8f226ec8edafe7757
parent: 50d274e5a66e4baed5fc0ade52650970a1728798 (diff)
download: glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.xz
glibc-91f07167e37541706554e4117c32aae1bd436cc9.zip
1 files changed, 2892 insertions, 2892 deletions
diff --git a/manual/charset.texi b/manual/charset.texi
index bb9cc64b8d..b7b2f734a8 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -1,2892 +1,2892 @@
-@node Character Set Handling, Locales, String and Array Utilities, Top
-@c %MENU% Support for extended character sets
-@chapter Character Set Handling
-
-@ifnottex
-@macro cal{text}
-\text\
-@end macro
-@end ifnottex
-
-Character sets used in the early days of computing had only six, seven,
-or eight bits for each character: there was never a case where more than
-eight bits (one byte) were used to represent a single character.  The
-limitations of this approach became more apparent as more people
-grappled with non-Roman character sets, where not all the characters
-that make up a language's character set can be represented by @math{2^8}
-choices.  This chapter shows the functionality which was added to the C
-library to support multiple character sets.
-
-@menu
-* Extended Char Intro::              Introduction to Extended Characters.
-* Charset Function Overview::        Overview about Character Handling
-                                      Functions.
-* Restartable multibyte conversion:: Restartable multibyte conversion
-                                      Functions.
-* Non-reentrant Conversion::         Non-reentrant Conversion Function.
-* Generic Charset Conversion::       Generic Charset Conversion.
-@end menu
-
-
-@node Extended Char Intro
-@section Introduction to Extended Characters
-
-A variety of solutions to overcome the differences between
-character sets with a 1:1 relation between bytes and characters and
-character sets with ratios of 2:1 or 4:1 exist. The remainder of this
-section gives a few examples to help understand the design decisions
-made while developing the functionality of the @w{C library}.
-
-@cindex internal representation
-A distinction we have to make right away is between internal and
-external representation.  @dfn{Internal representation} means the
-representation used by a program while keeping the text in memory.
-External representations are used when text is stored or transmitted
-through whatever communication channel.  Examples of external
-representations include files lying in a directory that are going to be
-read and parsed.
-
-Traditionally there has been no difference between the two representations.
-It was equally comfortable and useful to use the same single-byte
-representation internally and externally.  This changes with more and
-larger character sets.
-
-One of the problems to overcome with the internal representation is
-handling text that is externally encoded using different character
-sets.  Assume a program which reads two texts and compares them using
-some metric.  The comparison can be usefully done only if the texts are
-internally kept in a common format.
-
-@cindex wide character
-For such a common format (@math{=} character set) eight bits are certainly
-no longer enough.  So the smallest entity will have to grow: @dfn{wide
-characters} will now be used.  Instead of one byte, two or four will
-be used instead.  (Three are not good to address in memory and more
-than four bytes seem not to be necessary).
-
-@cindex Unicode
-@cindex ISO 10646
-As shown in some other part of this manual,
-@c !!! Ahem, wide char string functions are not yet covered -- drepper
-there exists a completely new family of functions which can handle texts
-of this kind in memory.  The most commonly used character sets for such
-internal wide character representations are Unicode and @w{ISO 10646}
-(also known as UCS for Universal Character Set). Unicode was originally
-planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
-be a 31-bit large code space. The two standards are practically identical.
-They have the same character repertoire and code table, but Unicode specifies
-added semantics.  At the moment, only characters in the first @code{0x10000}
-code positions (the so-called Basic Multilingual Plane, BMP) have been
-assigned, but the assignment of more specialized characters outside this
-16-bit space is already in progress. A number of encodings have been
-defined for Unicode and @w{ISO 10646} characters:
-@cindex UCS-2
-@cindex UCS-4
-@cindex UTF-8
-@cindex UTF-16
-UCS-2 is a 16-bit word that can only represent characters
-from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
-and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
-ASCII characters are represented by ASCII bytes and non-ASCII characters
-by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
-of UCS-2 in which pairs of certain UCS-2 words can be used to encode
-non-BMP characters up to @code{0x10ffff}.
-
-To represent wide characters the @code{char} type is not suitable.  For
-this reason the @w{ISO C} standard introduces a new type which is
-designed to keep one character of a wide character string.  To maintain
-the similarity there is also a type corresponding to @code{int} for
-those functions which take a single wide character.
-
-@comment stddef.h
-@comment ISO
-@deftp {Data type} wchar_t
-This data type is used as the base type for wide character strings.
-I.e., arrays of objects of this type are the equivalent of @code{char[]}
-for multibyte character strings.  The type is defined in @file{stddef.h}.
-
-The @w{ISO C90} standard, where this type was introduced, does not say
-anything specific about the representation.  It only requires that this
-type is capable of storing all elements of the basic character set.
-Therefore it would be legitimate to define @code{wchar_t} as
-@code{char}.  This might make sense for embedded systems.
-
-But for GNU systems this type is always 32 bits wide.  It is therefore
-capable of representing all UCS-4 values and  therefore covering all of
-@w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type and
-thereby follow Unicode very strictly.  This is perfectly fine with the
-standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
-fact a multi-wide-character encoding.  But this contradicts the purpose
-of the @code{wchar_t} type.
-@end deftp
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} wint_t
-@code{wint_t} is a data type used for parameters and variables which
-contain a single wide character.  As the name already suggests it is the
-equivalent to @code{int} when using the normal @code{char} strings.  The
-types @code{wchar_t} and @code{wint_t} have often the same
-representation if their size if 32 bits wide but if @code{wchar_t} is
-defined as @code{char} the type @code{wint_t} must be defined as
-@code{int} due to the parameter promotion.
-
-@pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-As there are for the @code{char} data type there also exist macros
-specifying the minimum and maximum value representable in an object of
-type @code{wchar_t}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MIN
-The macro @code{WCHAR_MIN} evaluates to the minimum value representable
-by an object of type @code{wint_t}.
-
-This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MAX
-The macro @code{WCHAR_MAX} evaluates to the maximum value representable
-by an object of type @code{wint_t}.
-
-This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-Another special wide character value is the equivalent to @code{EOF}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WEOF
-The macro @code{WEOF} evaluates to a constant expression of type
-@code{wint_t} whose value is different from any member of the extended
-character set.
-
-@code{WEOF} need not be the same value as @code{EOF} and unlike
-@code{EOF} it also need @emph{not} be negative.  I.e., sloppy code like
-
-@smallexample
-@{
-  int c;
-  ...
-  while ((c = getc (fp)) < 0)
-    ...
-@}
-@end smallexample
-
-@noindent
-has to be rewritten to explicitly use @code{WEOF} when wide characters
-are used.
-
-@smallexample
-@{
-  wint_t c;
-  ...
-  while ((c = wgetc (fp)) != WEOF)
-    ...
-@}
-@end smallexample
-
-@pindex wchar.h
-This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
-defined in @file{wchar.h}.
-@end deftypevr
-
-
-These internal representations present problems when it comes to storing
-and transmittal, since a single wide character consists of more
-than one byte they are effected by byte-ordering.  I.e., machines with
-different endianesses would see different value accessing the same data.
-This also applies for communication protocols which are all byte-based
-and therefore the sender has to decide about splitting the wide
-character in bytes.  A last (but not least important) point is that wide
-characters often require more storage space than an customized byte
-oriented character set.
-
-@cindex multibyte character
-@cindex EBCDIC
-   For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS-2 or UCS-4.
-The external encoding is byte-based and can be chosen appropriately for
-the environment and for the texts to be handled.  There exist a variety
-of different character sets which can be used for this external
-encoding. Information which will not be exhaustively presented
-here--instead, a description of the major groups will suffice.  All of
-the ASCII-based character sets fulfill one requirement: they are
-"filesystem safe".  This means that the character @code{'/'} is used in
-the encoding @emph{only} to represent itself.  Things are a bit
-different for character sets like EBCDIC (Extended Binary Coded Decimal
-Interchange Code, a character set family used by IBM) but if the
-operation system does not understand EBCDIC directly the parameters to
-system calls have to be converted first anyhow.
-
-@itemize @bullet
-@item
-The simplest character sets are single-byte character sets.  There can
-be only up to 256 characters (for @w{8 bit} character sets) which is not
-sufficient to cover all languages but might be sufficient to handle a
-specific text.  Handling of @w{8 bit} character sets is simple.  This is
-not true for the other kinds presented later and therefore the
-application one uses might require the use of @w{8 bit} character sets.
-
-@cindex ISO 2022
-@item
-The @w{ISO 2022} standard defines a mechanism for extended character
-sets where one character @emph{can} be represented by more than one
-byte.  This is achieved by associating a state with the text.  Embedded
-in the text can be characters which can be used to change the state.
-Each byte in the text might have a different interpretation in each
-state.  The state might even influence whether a given byte stands for a
-character on its own or whether it has to be combined with some more
-bytes.
-
-@cindex EUC
-@cindex Shift_JIS
-@cindex SJIS
-In most uses of @w{ISO 2022} the defined character sets do not allow
-state changes which cover more than the next character.  This has the
-big advantage that whenever one can identify the beginning of the byte
-sequence of a character one can interpret a text correctly.  Examples of
-character sets using this policy are the various EUC character sets
-(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or Shift_JIS (SJIS, a Japanese encoding).
-
-But there are also character sets using a state which is valid for more
-than one character and has to be changed by another byte sequence.
-Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
-
-@item
-@cindex ISO 6937
-Early attempts to fix 8 bit character sets for other languages using the
-Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
-representing characters like the acute accent do not produce output
-themselves: one has to combine them with other characters to get the
-desired result.  E.g., the byte sequence @code{0xc2 0x61} (non-spacing
-acute accent, following by lower-case `a') to get the ``small a with
-acute'' character.  To get the acute accent character on its own, one has
-to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
-
-This type of character set is used in some embedded systems such as
-teletex.
-
-@item
-@cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally,
-it is often also sufficient to simply use an encoding different than
-UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
-encoding: UTF-8.  This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to six.
-
-@cindex UTF-7
-There were a few other attempts to encode @w{ISO 10646} such as UTF-7
-but UTF-8 is today the only encoding which should be used.  In fact,
-UTF-8 will hopefully soon be the only external encoding that has to be
-supported.  It proves to be universally usable and the only disadvantage
-is that it favors Roman languages by making the byte string
-representation of other scripts (Cyrillic, Greek, Asian scripts) longer
-than necessary if using a specific character set for these scripts.
-Methods like the Unicode compression scheme can alleviate these
-problems.
-@end itemize
-
-The question remaining is: how to select the character set or encoding
-to use.  The answer: you cannot decide about it yourself, it is decided
-by the developers of the system or the majority of the users.  Since the
-goal is interoperability one has to use whatever the other people one
-works with use.  If there are no constraints the selection is based on
-the requirements the expected circle of users will have.  I.e., if a
-project is expected to only be used in, say, Russia it is fine to use
-KOI8-R or a similar character set.  But if at the same time people from,
-say, Greece are participating one should use a character set which allows
-all people to collaborate.
-
-The most widely useful solution seems to be: go with the most general
-character set, namely @w{ISO 10646}.  Use UTF-8 as the external encoding
-and problems about users not being able to use their own language
-adequately are a thing of the past.
-
-One final comment about the choice of the wide character representation
-is necessary at this point.  We have said above that the natural choice
-is using Unicode or @w{ISO 10646}.  This is not required, but at least
-encouraged, by the @w{ISO C} standard.  The standard defines at least a
-macro @code{__STDC_ISO_10646__} that is only defined on systems where
-the @code{wchar_t} type encodes @w{ISO 10646} characters.  If this
-symbol is not defined one should as much as possible avoid making
-assumption about the wide character representation.  If the programmer
-uses only the functions provided by the C library to handle wide
-character strings there should not be any compatibility problems with
-other systems.
-
-@node Charset Function Overview
-@section Overview about Character Handling Functions
-
-A Unix @w{C library} contains three different sets of functions in two
-families to handle character set conversion.  The one function family
-is specified in the @w{ISO C} standard and therefore is portable even
-beyond the Unix world.
-
-The most commonly known set of functions, coming from the @w{ISO C90}
-standard, is unfortunately the least useful one.  In fact, these
-functions should be avoided whenever possible, especially when
-developing libraries (as opposed to applications).
-
-The second family of functions got introduced in the early Unix standards
-(XPG2) and is still part of the latest and greatest Unix standard:
-@w{Unix 98}.  It is also the most powerful and useful set of functions.
-But we will start with the functions defined in @w{Amendment 1} to
-@w{ISO C90}.
-
-@node Restartable multibyte conversion
-@section Restartable Multibyte Conversion Functions
-
-The @w{ISO C} standard defines functions to convert strings from a
-multibyte representation to wide character strings.  There are a number
-of peculiarities:
-
-@itemize @bullet
-@item
-The character set assumed for the multibyte encoding is not specified
-as an argument to the functions.  Instead the character set specified by
-the @code{LC_CTYPE} category of the current locale is used; see
-@ref{Locale Categories}.
-
-@item
-The functions handling more than one character at a time require NUL
-terminated strings as the argument.  I.e., converting blocks of text
-does not work unless one can add a NUL byte at an appropriate place.
-The GNU C library contains some extensions the standard which allow
-specifying a size but basically they also expect terminated strings.
-@end itemize
-
-Despite these limitations the @w{ISO C} functions can very well be used
-in many contexts.  In graphical user interfaces, for instance, it is not
-uncommon to have functions which require text to be displayed in a wide
-character string if it is not simple ASCII.  The text itself might come
-from a file with translations and the user should decide about the
-current locale which determines the translation and therefore also the
-external encoding used.  In such a situation (and many others) the
-functions described here are perfect.  If more freedom while performing
-the conversion is necessary take a look at the @code{iconv} functions
-(@pxref{Generic Charset Conversion}).
-
-@menu
-* Selecting the Conversion::     Selecting the conversion and its properties.
-* Keeping the state::            Representing the state of the conversion.
-* Converting a Character::       Converting Single Characters.
-* Converting Strings::           Converting Multibyte and Wide Character
-                                  Strings.
-* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
-@end menu
-
-@node Selecting the Conversion
-@subsection Selecting the conversion and its properties
-
-We already said above that the currently selected locale for the
-@code{LC_CTYPE} category decides about the conversion which is performed
-by the functions we are about to describe.  Each locale uses its own
-character set (given as an argument to @code{localedef}) and this is the
-one assumed as the external multibyte encoding.  The wide character
-character set always is UCS-4, at least on GNU systems.
-
-A characteristic of each multibyte character set is the maximum number
-of bytes which can be necessary to represent one character.  This
-information is quite important when writing code which uses the
-conversion functions.  In the examples below we will see some examples.
-The @w{ISO C} standard defines two macros which provide this information.
-
-
-@comment limits.h
-@comment ISO
-@deftypevr Macro int MB_LEN_MAX
-This macro specifies the maximum number of bytes in the multibyte
-sequence for a single character in any of the supported locales.  It is
-a compile-time constant and it is defined in @file{limits.h}.
-@pindex limits.h
-@end deftypevr
-
-@comment stdlib.h
-@comment ISO
-@deftypevr Macro int MB_CUR_MAX
-@code{MB_CUR_MAX} expands into a positive integer expression that is the
-maximum number of bytes in a multibyte character in the current locale.
-The value is never greater than @code{MB_LEN_MAX}.  Unlike
-@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
-fact, in the GNU C library it is not.
-
-@pindex stdlib.h
-@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
-@end deftypevr
-
-Two different macros are necessary since strictly @w{ISO C90} compilers
-do not allow variable length array definitions but still it is desirable
-to avoid dynamic allocation.  This incomplete piece of code shows the
-problem:
-
-@smallexample
-@{
-  char buf[MB_LEN_MAX];
-  ssize_t len = 0;
-
-  while (! feof (fp))
-    @{
-      fread (&buf[len], 1, MB_CUR_MAX - len, fp);
-      /* @r{... process} buf */
-      len -= used;
-    @}
-@}
-@end smallexample
-
-The code in the inner loop is expected to have always enough bytes in
-the array @var{buf} to convert one multibyte character.  The array
-@var{buf} has to be sized statically since many compilers do not allow a
-variable size.  The @code{fread} call makes sure that always
-@code{MB_CUR_MAX} bytes are available in @var{buf}.  Note that it isn't
-a problem if @code{MB_CUR_MAX} is not a compile-time constant.
-
-
-@node Keeping the state
-@subsection Representing the state of the conversion
-
-@cindex stateful
-In the introduction of this chapter it was said that certain character
-sets use a @dfn{stateful} encoding.  I.e., the encoded values depend in
-some way on the previous bytes in the text.
-
-Since the conversion functions allow converting a text in more than one
-step we must have a way to pass this information from one call of the
-functions to another.
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} mbstate_t
-@cindex shift state
-A variable of type @code{mbstate_t} can contain all the information
-about the @dfn{shift state} needed from one call to a conversion
-function to another.
-
-@pindex wchar.h
-This type is defined in @file{wchar.h}.  It got introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-To use objects of this type the programmer has to define such objects
-(normally as local variables on the stack) and pass a pointer to the
-object to the conversion functions.  This way the conversion function
-can update the object if the current multibyte character set is
-stateful.
-
-There is no specific function or initializer to put the state object in
-any specific state.  The rules are that the object should always
-represent the initial state before the first use and this is achieved by
-clearing the whole variable with code such as follows:
-
-@smallexample
-@{
-  mbstate_t state;
-  memset (&state, '\0', sizeof (state));
-  /* @r{from now on @var{state} can be used.}  */
-  ...
-@}
-@end smallexample
-
-When using the conversion functions to generate output it is often
-necessary to test whether the current state corresponds to the initial
-state.  This is necessary, for example, to decide whether or not to emit
-escape sequences to set the state to the initial state at certain
-sequence points.  Communication protocols often require this.
-
-@comment wchar.h
-@comment ISO
-@deftypefun int mbsinit (const mbstate_t *@var{ps})
-This function determines whether the state object pointed to by @var{ps}
-is in the initial state or not.  If @var{ps} is a null pointer or the
-object is in the initial state the return value is nonzero.  Otherwise
-it is zero.
-
-@pindex wchar.h
-This function was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-Code using this function often looks similar to this:
-
-@c Fix the example to explicitly say how to generate the escape sequence
-@c to restore the initial state.
-@smallexample
-@{
-  mbstate_t state;
-  memset (&state, '\0', sizeof (state));
-  /* @r{Use @var{state}.}  */
-  ...
-  if (! mbsinit (&state))
-    @{
-      /* @r{Emit code to return to initial state.}  */
-      const wchar_t empty[] = L"";
-      const wchar_t *srcp = empty;
-      wcsrtombs (outbuf, &srcp, outbuflen, &state);
-    @}
-  ...
-@}
-@end smallexample
-
-The code to emit the escape sequence to get back to the initial state is
-interesting.  The @code{wcsrtombs} function can be used to determine the
-necessary output code (@pxref{Converting Strings}).  Please note that on
-GNU systems it is not necessary to perform this extra action for the
-conversion from multibyte text to wide character text since the wide
-character encoding is not stateful.  But there is nothing mentioned in
-any standard which prohibits making @code{wchar_t} using a stateful
-encoding.
-
-@node Converting a Character
-@subsection Converting Single Characters
-
-The most fundamental of the conversion functions are those dealing with
-single characters.  Please note that this does not always mean single
-bytes.  But since there is very often a subset of the multibyte
-character set which consists of single byte sequences there are
-functions to help with converting bytes.  One very important and often
-applicable scenario is where ASCII is a subpart of the multibyte
-character set.  I.e., all ASCII characters stand for itself and all
-other characters have at least a first byte which is beyond the range
-@math{0} to @math{127}.
-
-@comment wchar.h
-@comment ISO
-@deftypefun wint_t btowc (int @var{c})
-The @code{btowc} function (``byte to wide character'') converts a valid
-single byte character @var{c} in the initial shift state into the wide
-character equivalent using the conversion rules from the currently
-selected locale of the @code{LC_CTYPE} category.
-
-If @code{(unsigned char) @var{c}} is no valid single byte multibyte
-character or if @var{c} is @code{EOF} the function returns @code{WEOF}.
-
-Please note the restriction of @var{c} being tested for validity only in
-the initial shift state.  There is no @code{mbstate_t} object used from
-which the state information is taken and the function also does not use
-any static state.
-
-@pindex wchar.h
-This function was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-Despite the limitation that the single byte value always is interpreted
-in the initial state this function is actually useful most of the time.
-Most characters are either entirely single-byte character sets or they
-are extension to ASCII.  But then it is possible to write code like this
-(not that this specific example is very useful):
-
-@smallexample
-wchar_t *
-itow (unsigned long int val)
-@{
-  static wchar_t buf[30];
-  wchar_t *wcp = &buf[29];
-  *wcp = L'\0';
-  while (val != 0)
-    @{
-      *--wcp = btowc ('0' + val % 10);
-      val /= 10;
-    @}
-  if (wcp == &buf[29])
-    *--wcp = L'0';
-  return wcp;
-@}
-@end smallexample
-
author	Ulrich Drepper <drepper@redhat.com>	2001-11-05 08:04:39 +0000
committer	Ulrich Drepper <drepper@redhat.com>	2001-11-05 08:04:39 +0000
commit	91f07167e37541706554e4117c32aae1bd436cc9 (patch)
tree	05ece0714b396155a8e923f8f226ec8edafe7757
parent	50d274e5a66e4baed5fc0ade52650970a1728798 (diff)
download	glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.xz glibc-91f07167e37541706554e4117c32aae1bd436cc9.zip