aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorUlrich Drepper <drepper@redhat.com>2001-11-05 08:04:39 +0000
committerUlrich Drepper <drepper@redhat.com>2001-11-05 08:04:39 +0000
commit91f07167e37541706554e4117c32aae1bd436cc9 (patch)
tree05ece0714b396155a8e923f8f226ec8edafe7757
parent50d274e5a66e4baed5fc0ade52650970a1728798 (diff)
downloadglibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.xz
glibc-91f07167e37541706554e4117c32aae1bd436cc9.zip
Editing.
-rw-r--r--manual/charset.texi5784
1 files changed, 2892 insertions, 2892 deletions
diff --git a/manual/charset.texi b/manual/charset.texi
index bb9cc64b8d..b7b2f734a8 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -1,2892 +1,2892 @@
-@node Character Set Handling, Locales, String and Array Utilities, Top
-@c %MENU% Support for extended character sets
-@chapter Character Set Handling
-
-@ifnottex
-@macro cal{text}
-\text\
-@end macro
-@end ifnottex
-
-Character sets used in the early days of computing had only six, seven,
-or eight bits for each character: there was never a case where more than
-eight bits (one byte) were used to represent a single character. The
-limitations of this approach became more apparent as more people
-grappled with non-Roman character sets, where not all the characters
-that make up a language's character set can be represented by @math{2^8}
-choices. This chapter shows the functionality which was added to the C
-library to support multiple character sets.
-
-@menu
-* Extended Char Intro:: Introduction to Extended Characters.
-* Charset Function Overview:: Overview about Character Handling
- Functions.
-* Restartable multibyte conversion:: Restartable multibyte conversion
- Functions.
-* Non-reentrant Conversion:: Non-reentrant Conversion Function.
-* Generic Charset Conversion:: Generic Charset Conversion.
-@end menu
-
-
-@node Extended Char Intro
-@section Introduction to Extended Characters
-
-A variety of solutions to overcome the differences between
-character sets with a 1:1 relation between bytes and characters and
-character sets with ratios of 2:1 or 4:1 exist. The remainder of this
-section gives a few examples to help understand the design decisions
-made while developing the functionality of the @w{C library}.
-
-@cindex internal representation
-A distinction we have to make right away is between internal and
-external representation. @dfn{Internal representation} means the
-representation used by a program while keeping the text in memory.
-External representations are used when text is stored or transmitted
-through whatever communication channel. Examples of external
-representations include files lying in a directory that are going to be
-read and parsed.
-
-Traditionally there has been no difference between the two representations.
-It was equally comfortable and useful to use the same single-byte
-representation internally and externally. This changes with more and
-larger character sets.
-
-One of the problems to overcome with the internal representation is
-handling text that is externally encoded using different character
-sets. Assume a program which reads two texts and compares them using
-some metric. The comparison can be usefully done only if the texts are
-internally kept in a common format.
-
-@cindex wide character
-For such a common format (@math{=} character set) eight bits are certainly
-no longer enough. So the smallest entity will have to grow: @dfn{wide
-characters} will now be used. Instead of one byte, two or four will
-be used instead. (Three are not good to address in memory and more
-than four bytes seem not to be necessary).
-
-@cindex Unicode
-@cindex ISO 10646
-As shown in some other part of this manual,
-@c !!! Ahem, wide char string functions are not yet covered -- drepper
-there exists a completely new family of functions which can handle texts
-of this kind in memory. The most commonly used character sets for such
-internal wide character representations are Unicode and @w{ISO 10646}
-(also known as UCS for Universal Character Set). Unicode was originally
-planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
-be a 31-bit large code space. The two standards are practically identical.
-They have the same character repertoire and code table, but Unicode specifies
-added semantics. At the moment, only characters in the first @code{0x10000}
-code positions (the so-called Basic Multilingual Plane, BMP) have been
-assigned, but the assignment of more specialized characters outside this
-16-bit space is already in progress. A number of encodings have been
-defined for Unicode and @w{ISO 10646} characters:
-@cindex UCS-2
-@cindex UCS-4
-@cindex UTF-8
-@cindex UTF-16
-UCS-2 is a 16-bit word that can only represent characters
-from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
-and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
-ASCII characters are represented by ASCII bytes and non-ASCII characters
-by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
-of UCS-2 in which pairs of certain UCS-2 words can be used to encode
-non-BMP characters up to @code{0x10ffff}.
-
-To represent wide characters the @code{char} type is not suitable. For
-this reason the @w{ISO C} standard introduces a new type which is
-designed to keep one character of a wide character string. To maintain
-the similarity there is also a type corresponding to @code{int} for
-those functions which take a single wide character.
-
-@comment stddef.h
-@comment ISO
-@deftp {Data type} wchar_t
-This data type is used as the base type for wide character strings.
-I.e., arrays of objects of this type are the equivalent of @code{char[]}
-for multibyte character strings. The type is defined in @file{stddef.h}.
-
-The @w{ISO C90} standard, where this type was introduced, does not say
-anything specific about the representation. It only requires that this
-type is capable of storing all elements of the basic character set.
-Therefore it would be legitimate to define @code{wchar_t} as
-@code{char}. This might make sense for embedded systems.
-
-But for GNU systems this type is always 32 bits wide. It is therefore
-capable of representing all UCS-4 values and therefore covering all of
-@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and
-thereby follow Unicode very strictly. This is perfectly fine with the
-standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
-fact a multi-wide-character encoding. But this contradicts the purpose
-of the @code{wchar_t} type.
-@end deftp
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} wint_t
-@code{wint_t} is a data type used for parameters and variables which
-contain a single wide character. As the name already suggests it is the
-equivalent to @code{int} when using the normal @code{char} strings. The
-types @code{wchar_t} and @code{wint_t} have often the same
-representation if their size if 32 bits wide but if @code{wchar_t} is
-defined as @code{char} the type @code{wint_t} must be defined as
-@code{int} due to the parameter promotion.
-
-@pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-As there are for the @code{char} data type there also exist macros
-specifying the minimum and maximum value representable in an object of
-type @code{wchar_t}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MIN
-The macro @code{WCHAR_MIN} evaluates to the minimum value representable
-by an object of type @code{wint_t}.
-
-This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WCHAR_MAX
-The macro @code{WCHAR_MAX} evaluates to the maximum value representable
-by an object of type @code{wint_t}.
-
-This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
-@end deftypevr
-
-Another special wide character value is the equivalent to @code{EOF}.
-
-@comment wchar.h
-@comment ISO
-@deftypevr Macro wint_t WEOF
-The macro @code{WEOF} evaluates to a constant expression of type
-@code{wint_t} whose value is different from any member of the extended
-character set.
-
-@code{WEOF} need not be the same value as @code{EOF} and unlike
-@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
-
-@smallexample
-@{
- int c;
- ...
- while ((c = getc (fp)) < 0)
- ...
-@}
-@end smallexample
-
-@noindent
-has to be rewritten to explicitly use @code{WEOF} when wide characters
-are used.
-
-@smallexample
-@{
- wint_t c;
- ...
- while ((c = wgetc (fp)) != WEOF)
- ...
-@}
-@end smallexample
-
-@pindex wchar.h
-This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
-defined in @file{wchar.h}.
-@end deftypevr
-
-
-These internal representations present problems when it comes to storing
-and transmittal, since a single wide character consists of more
-than one byte they are effected by byte-ordering. I.e., machines with
-different endianesses would see different value accessing the same data.
-This also applies for communication protocols which are all byte-based
-and therefore the sender has to decide about splitting the wide
-character in bytes. A last (but not least important) point is that wide
-characters often require more storage space than an customized byte
-oriented character set.
-
-@cindex multibyte character
-@cindex EBCDIC
- For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS-2 or UCS-4.
-The external encoding is byte-based and can be chosen appropriately for
-the environment and for the texts to be handled. There exist a variety
-of different character sets which can be used for this external
-encoding. Information which will not be exhaustively presented
-here--instead, a description of the major groups will suffice. All of
-the ASCII-based character sets fulfill one requirement: they are
-"filesystem safe". This means that the character @code{'/'} is used in
-the encoding @emph{only} to represent itself. Things are a bit
-different for character sets like EBCDIC (Extended Binary Coded Decimal
-Interchange Code, a character set family used by IBM) but if the
-operation system does not understand EBCDIC directly the parameters to
-system calls have to be converted first anyhow.
-
-@itemize @bullet
-@item
-The simplest character sets are single-byte character sets. There can
-be only up to 256 characters (for @w{8 bit} character sets) which is not
-sufficient to cover all languages but might be sufficient to handle a
-specific text. Handling of @w{8 bit} character sets is simple. This is
-not true for the other kinds presented later and therefore the
-application one uses might require the use of @w{8 bit} character sets.
-
-@cindex ISO 2022
-@item
-The @w{ISO 2022} standard defines a mechanism for extended character
-sets where one character @emph{can} be represented by more than one
-byte. This is achieved by associating a state with the text. Embedded
-in the text can be characters which can be used to change the state.
-Each byte in the text might have a different interpretation in each
-state. The state might even influence whether a given byte stands for a
-character on its own or whether it has to be combined with some more
-bytes.
-
-@cindex EUC
-@cindex Shift_JIS
-@cindex SJIS
-In most uses of @w{ISO 2022} the defined character sets do not allow
-state changes which cover more than the next character. This has the
-big advantage that whenever one can identify the beginning of the byte
-sequence of a character one can interpret a text correctly. Examples of
-character sets using this policy are the various EUC character sets
-(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or Shift_JIS (SJIS, a Japanese encoding).
-
-But there are also character sets using a state which is valid for more
-than one character and has to be changed by another byte sequence.
-Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
-
-@item
-@cindex ISO 6937
-Early attempts to fix 8 bit character sets for other languages using the
-Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
-representing characters like the acute accent do not produce output
-themselves: one has to combine them with other characters to get the
-desired result. E.g., the byte sequence @code{0xc2 0x61} (non-spacing
-acute accent, following by lower-case `a') to get the ``small a with
-acute'' character. To get the acute accent character on its own, one has
-to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
-
-This type of character set is used in some embedded systems such as
-teletex.
-
-@item
-@cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally,
-it is often also sufficient to simply use an encoding different than
-UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
-encoding: UTF-8. This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to six.
-
-@cindex UTF-7
-There were a few other attempts to encode @w{ISO 10646} such as UTF-7
-but UTF-8 is today the only encoding which should be used. In fact,
-UTF-8 will hopefully soon be the only external encoding that has to be
-supported. It proves to be universally usable and the only disadvantage
-is that it favors Roman languages by making the byte string
-representation of other scripts (Cyrillic, Greek, Asian scripts) longer
-than necessary if using a specific character set for these scripts.
-Methods like the Unicode compression scheme can alleviate these
-problems.
-@end itemize
-
-The question remaining is: how to select the character set or encoding
-to use. The answer: you cannot decide about it yourself, it is decided
-by the developers of the system or the majority of the users. Since the
-goal is interoperability one has to use whatever the other people one
-works with use. If there are no constraints the selection is based on
-the requirements the expected circle of users will have. I.e., if a
-project is expected to only be used in, say, Russia it is fine to use
-KOI8-R or a similar character set. But if at the same time people from,
-say, Greece are participating one should use a character set which allows
-all people to collaborate.
-
-The most widely useful solution seems to be: go with the most general
-character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding
-and problems about users not being able to use their own language
-adequately are a thing of the past.
-
-One final comment about the choice of the wide character representation
-is necessary at this point. We have said above that the natural choice
-is using Unicode or @w{ISO 10646}. This is not required, but at least
-encouraged, by the @w{ISO C} standard. The standard defines at least a
-macro @code{__STDC_ISO_10646__} that is only defined on systems where
-the @code{wchar_t} type encodes @w{ISO 10646} characters. If this
-symbol is not defined one should as much as possible avoid making
-assumption about the wide character representation. If the programmer
-uses only the functions provided by the C library to handle wide
-character strings there should not be any compatibility problems with
-other systems.
-
-@node Charset Function Overview
-@section Overview about Character Handling Functions
-
-A Unix @w{C library} contains three different sets of functions in two
-families to handle character set conversion. The one function family
-is specified in the @w{ISO C} standard and therefore is portable even
-beyond the Unix world.
-
-The most commonly known set of functions, coming from the @w{ISO C90}
-standard, is unfortunately the least useful one. In fact, these
-functions should be avoided whenever possible, especially when
-developing libraries (as opposed to applications).
-
-The second family of functions got introduced in the early Unix standards
-(XPG2) and is still part of the latest and greatest Unix standard:
-@w{Unix 98}. It is also the most powerful and useful set of functions.
-But we will start with the functions defined in @w{Amendment 1} to
-@w{ISO C90}.
-
-@node Restartable multibyte conversion
-@section Restartable Multibyte Conversion Functions
-
-The @w{ISO C} standard defines functions to convert strings from a
-multibyte representation to wide character strings. There are a number
-of peculiarities:
-
-@itemize @bullet
-@item
-The character set assumed for the multibyte encoding is not specified
-as an argument to the functions. Instead the character set specified by
-the @code{LC_CTYPE} category of the current locale is used; see
-@ref{Locale Categories}.
-
-@item
-The functions handling more than one character at a time require NUL
-terminated strings as the argument. I.e., converting blocks of text
-does not work unless one can add a NUL byte at an appropriate place.
-The GNU C library contains some extensions the standard which allow
-specifying a size but basically they also expect terminated strings.
-@end itemize
-
-Despite these limitations the @w{ISO C} functions can very well be used
-in many contexts. In graphical user interfaces, for instance, it is not
-uncommon to have functions which require text to be displayed in a wide
-character string if it is not simple ASCII. The text itself might come
-from a file with translations and the user should decide about the
-current locale which determines the translation and therefore also the
-external encoding used. In such a situation (and many others) the
-functions described here are perfect. If more freedom while performing
-the conversion is necessary take a look at the @code{iconv} functions
-(@pxref{Generic Charset Conversion}).
-
-@menu
-* Selecting the Conversion:: Selecting the conversion and its properties.
-* Keeping the state:: Representing the state of the conversion.
-* Converting a Character:: Converting Single Characters.
-* Converting Strings:: Converting Multibyte and Wide Character
- Strings.
-* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
-@end menu
-
-@node Selecting the Conversion
-@subsection Selecting the conversion and its properties
-
-We already said above that the currently selected locale for the
-@code{LC_CTYPE} category decides about the conversion which is performed
-by the functions we are about to describe. Each locale uses its own
-character set (given as an argument to @code{localedef}) and this is the
-one assumed as the external multibyte encoding. The wide character
-character set always is UCS-4, at least on GNU systems.
-
-A characteristic of each multibyte character set is the maximum number
-of bytes which can be necessary to represent one character. This
-information is quite important when writing code which uses the
-conversion functions. In the examples below we will see some examples.
-The @w{ISO C} standard defines two macros which provide this information.
-
-
-@comment limits.h
-@comment ISO
-@deftypevr Macro int MB_LEN_MAX
-This macro specifies the maximum number of bytes in the multibyte
-sequence for a single character in any of the supported locales. It is
-a compile-time constant and it is defined in @file{limits.h}.
-@pindex limits.h
-@end deftypevr
-
-@comment stdlib.h
-@comment ISO
-@deftypevr Macro int MB_CUR_MAX
-@code{MB_CUR_MAX} expands into a positive integer expression that is the
-maximum number of bytes in a multibyte character in the current locale.
-The value is never greater than @code{MB_LEN_MAX}. Unlike
-@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
-fact, in the GNU C library it is not.
-
-@pindex stdlib.h
-@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
-@end deftypevr
-
-Two different macros are necessary since strictly @w{ISO C90} compilers
-do not allow variable length array definitions but still it is desirable
-to avoid dynamic allocation. This incomplete piece of code shows the
-problem:
-
-@smallexample
-@{
- char buf[MB_LEN_MAX];
- ssize_t len = 0;
-
- while (! feof (fp))
- @{
- fread (&buf[len], 1, MB_CUR_MAX - len, fp);
- /* @r{... process} buf */
- len -= used;
- @}
-@}
-@end smallexample
-
-The code in the inner loop is expected to have always enough bytes in
-the array @var{buf} to convert one multibyte character. The array
-@var{buf} has to be sized statically since many compilers do not allow a
-variable size. The @code{fread} call makes sure that always
-@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it isn't
-a problem if @code{MB_CUR_MAX} is not a compile-time constant.
-
-
-@node Keeping the state
-@subsection Representing the state of the conversion
-
-@cindex stateful
-In the introduction of this chapter it was said that certain character
-sets use a @dfn{stateful} encoding. I.e., the encoded values depend in
-some way on the previous bytes in the text.
-
-Since the conversion functions allow converting a text in more than one
-step we must have a way to pass this information from one call of the
-functions to another.
-
-@comment wchar.h
-@comment ISO
-@deftp {Data type} mbstate_t
-@cindex shift state
-A variable of type @code{mbstate_t} can contain all the information
-about the @dfn{shift state} needed from one call to a conversion
-function to another.
-
-@pindex wchar.h
-This type is defined in @file{wchar.h}. It got introduced in
-@w{Amendment 1} to @w{ISO C90}.
-@end deftp
-
-To use objects of this type the programmer has to define such objects
-(normally as local variables on the stack) and pass a pointer to the
-object to the conversion functions. This way the conversion function
-can update the object if the current multibyte character set is
-stateful.
-
-There is no specific function or initializer to put the state object in
-any specific state. The rules are that the object should always
-represent the initial state before the first use and this is achieved by
-clearing the whole variable with code such as follows:
-
-@smallexample
-@{
- mbstate_t state;
- memset (&state, '\0', sizeof (state));
- /* @r{from now on @var{state} can be used.} */
- ...
-@}
-@end smallexample
-
-When using the conversion functions to generate output it is often
-necessary to test whether the current state corresponds to the initial
-state. This is necessary, for example, to decide whether or not to emit
-escape sequences to set the state to the initial state at certain
-sequence points. Communication protocols often require this.
-
-@comment wchar.h
-@comment ISO
-@deftypefun int mbsinit (const mbstate_t *@var{ps})
-This function determines whether the state object pointed to by @var{ps}
-is in the initial state or not. If @var{ps} is a null pointer or the
-object is in the initial state the return value is nonzero. Otherwise
-it is zero.
-
-@pindex wchar.h
-This function was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-Code using this function often looks similar to this:
-
-@c Fix the example to explicitly say how to generate the escape sequence
-@c to restore the initial state.
-@smallexample
-@{
- mbstate_t state;
- memset (&state, '\0', sizeof (state));
- /* @r{Use @var{state}.} */
- ...
- if (! mbsinit (&state))
- @{
- /* @r{Emit code to return to initial state.} */
- const wchar_t empty[] = L"";
- const wchar_t *srcp = empty;
- wcsrtombs (outbuf, &srcp, outbuflen, &state);
- @}
- ...
-@}
-@end smallexample
-
-The code to emit the escape sequence to get back to the initial state is
-interesting. The @code{wcsrtombs} function can be used to determine the
-necessary output code (@pxref{Converting Strings}). Please note that on
-GNU systems it is not necessary to perform this extra action for the
-conversion from multibyte text to wide character text since the wide
-character encoding is not stateful. But there is nothing mentioned in
-any standard which prohibits making @code{wchar_t} using a stateful
-encoding.
-
-@node Converting a Character
-@subsection Converting Single Characters
-
-The most fundamental of the conversion functions are those dealing with
-single characters. Please note that this does not always mean single
-bytes. But since there is very often a subset of the multibyte
-character set which consists of single byte sequences there are
-functions to help with converting bytes. One very important and often
-applicable scenario is where ASCII is a subpart of the multibyte
-character set. I.e., all ASCII characters stand for itself and all
-other characters have at least a first byte which is beyond the range
-@math{0} to @math{127}.
-
-@comment wchar.h
-@comment ISO
-@deftypefun wint_t btowc (int @var{c})
-The @code{btowc} function (``byte to wide character'') converts a valid
-single byte character @var{c} in the initial shift state into the wide
-character equivalent using the conversion rules from the currently
-selected locale of the @code{LC_CTYPE} category.
-
-If @code{(unsigned char) @var{c}} is no valid single byte multibyte
-character or if @var{c} is @code{EOF} the function returns @code{WEOF}.
-
-Please note the restriction of @var{c} being tested for validity only in
-the initial shift state. There is no @code{mbstate_t} object used from
-which the state information is taken and the function also does not use
-any static state.
-
-@pindex wchar.h
-This function was introduced in @w{Amendment 1} to @w{ISO C90} and
-is declared in @file{wchar.h}.
-@end deftypefun
-
-Despite the limitation that the single byte value always is interpreted
-in the initial state this function is actually useful most of the time.
-Most characters are either entirely single-byte character sets or they
-are extension to ASCII. But then it is possible to write code like this
-(not that this specific example is very useful):
-
-@smallexample
-wchar_t *
-itow (unsigned long int val)
-@{
- static wchar_t buf[30];
- wchar_t *wcp = &buf[29];
- *wcp = L'\0';
- while (val != 0)
- @{
- *--wcp = btowc ('0' + val % 10);
- val /= 10;
- @}
- if (wcp == &buf[29])
- *--wcp = L'0';
- return wcp;
-@}
-@end smallexample
-