diff options
| author | Ulrich Drepper <drepper@redhat.com> | 2001-11-05 08:04:39 +0000 |
|---|---|---|
| committer | Ulrich Drepper <drepper@redhat.com> | 2001-11-05 08:04:39 +0000 |
| commit | 91f07167e37541706554e4117c32aae1bd436cc9 (patch) | |
| tree | 05ece0714b396155a8e923f8f226ec8edafe7757 | |
| parent | 50d274e5a66e4baed5fc0ade52650970a1728798 (diff) | |
| download | glibc-91f07167e37541706554e4117c32aae1bd436cc9.tar.xz glibc-91f07167e37541706554e4117c32aae1bd436cc9.zip | |
Editing.
| -rw-r--r-- | manual/charset.texi | 5784 |
1 files changed, 2892 insertions, 2892 deletions
diff --git a/manual/charset.texi b/manual/charset.texi index bb9cc64b8d..b7b2f734a8 100644 --- a/manual/charset.texi +++ b/manual/charset.texi @@ -1,2892 +1,2892 @@ -@node Character Set Handling, Locales, String and Array Utilities, Top -@c %MENU% Support for extended character sets -@chapter Character Set Handling - -@ifnottex -@macro cal{text} -\text\ -@end macro -@end ifnottex - -Character sets used in the early days of computing had only six, seven, -or eight bits for each character: there was never a case where more than -eight bits (one byte) were used to represent a single character. The -limitations of this approach became more apparent as more people -grappled with non-Roman character sets, where not all the characters -that make up a language's character set can be represented by @math{2^8} -choices. This chapter shows the functionality which was added to the C -library to support multiple character sets. - -@menu -* Extended Char Intro:: Introduction to Extended Characters. -* Charset Function Overview:: Overview about Character Handling - Functions. -* Restartable multibyte conversion:: Restartable multibyte conversion - Functions. -* Non-reentrant Conversion:: Non-reentrant Conversion Function. -* Generic Charset Conversion:: Generic Charset Conversion. -@end menu - - -@node Extended Char Intro -@section Introduction to Extended Characters - -A variety of solutions to overcome the differences between -character sets with a 1:1 relation between bytes and characters and -character sets with ratios of 2:1 or 4:1 exist. The remainder of this -section gives a few examples to help understand the design decisions -made while developing the functionality of the @w{C library}. - -@cindex internal representation -A distinction we have to make right away is between internal and -external representation. @dfn{Internal representation} means the -representation used by a program while keeping the text in memory. -External representations are used when text is stored or transmitted -through whatever communication channel. Examples of external -representations include files lying in a directory that are going to be -read and parsed. - -Traditionally there has been no difference between the two representations. -It was equally comfortable and useful to use the same single-byte -representation internally and externally. This changes with more and -larger character sets. - -One of the problems to overcome with the internal representation is -handling text that is externally encoded using different character -sets. Assume a program which reads two texts and compares them using -some metric. The comparison can be usefully done only if the texts are -internally kept in a common format. - -@cindex wide character -For such a common format (@math{=} character set) eight bits are certainly -no longer enough. So the smallest entity will have to grow: @dfn{wide -characters} will now be used. Instead of one byte, two or four will -be used instead. (Three are not good to address in memory and more -than four bytes seem not to be necessary). - -@cindex Unicode -@cindex ISO 10646 -As shown in some other part of this manual, -@c !!! Ahem, wide char string functions are not yet covered -- drepper -there exists a completely new family of functions which can handle texts -of this kind in memory. The most commonly used character sets for such -internal wide character representations are Unicode and @w{ISO 10646} -(also known as UCS for Universal Character Set). Unicode was originally -planned as a 16-bit character set, whereas @w{ISO 10646} was designed to -be a 31-bit large code space. The two standards are practically identical. -They have the same character repertoire and code table, but Unicode specifies -added semantics. At the moment, only characters in the first @code{0x10000} -code positions (the so-called Basic Multilingual Plane, BMP) have been -assigned, but the assignment of more specialized characters outside this -16-bit space is already in progress. A number of encodings have been -defined for Unicode and @w{ISO 10646} characters: -@cindex UCS-2 -@cindex UCS-4 -@cindex UTF-8 -@cindex UTF-16 -UCS-2 is a 16-bit word that can only represent characters -from the BMP, UCS-4 is a 32-bit word than can represent any Unicode -and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where -ASCII characters are represented by ASCII bytes and non-ASCII characters -by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension -of UCS-2 in which pairs of certain UCS-2 words can be used to encode -non-BMP characters up to @code{0x10ffff}. - -To represent wide characters the @code{char} type is not suitable. For -this reason the @w{ISO C} standard introduces a new type which is -designed to keep one character of a wide character string. To maintain -the similarity there is also a type corresponding to @code{int} for -those functions which take a single wide character. - -@comment stddef.h -@comment ISO -@deftp {Data type} wchar_t -This data type is used as the base type for wide character strings. -I.e., arrays of objects of this type are the equivalent of @code{char[]} -for multibyte character strings. The type is defined in @file{stddef.h}. - -The @w{ISO C90} standard, where this type was introduced, does not say -anything specific about the representation. It only requires that this -type is capable of storing all elements of the basic character set. -Therefore it would be legitimate to define @code{wchar_t} as -@code{char}. This might make sense for embedded systems. - -But for GNU systems this type is always 32 bits wide. It is therefore -capable of representing all UCS-4 values and therefore covering all of -@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and -thereby follow Unicode very strictly. This is perfectly fine with the -standard but it also means that to represent all characters from Unicode -and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in -fact a multi-wide-character encoding. But this contradicts the purpose -of the @code{wchar_t} type. -@end deftp - -@comment wchar.h -@comment ISO -@deftp {Data type} wint_t -@code{wint_t} is a data type used for parameters and variables which -contain a single wide character. As the name already suggests it is the -equivalent to @code{int} when using the normal @code{char} strings. The -types @code{wchar_t} and @code{wint_t} have often the same -representation if their size if 32 bits wide but if @code{wchar_t} is -defined as @code{char} the type @code{wint_t} must be defined as -@code{int} due to the parameter promotion. - -@pindex wchar.h -This type is defined in @file{wchar.h} and got introduced in -@w{Amendment 1} to @w{ISO C90}. -@end deftp - -As there are for the @code{char} data type there also exist macros -specifying the minimum and maximum value representable in an object of -type @code{wchar_t}. - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WCHAR_MIN -The macro @code{WCHAR_MIN} evaluates to the minimum value representable -by an object of type @code{wint_t}. - -This macro got introduced in @w{Amendment 1} to @w{ISO C90}. -@end deftypevr - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WCHAR_MAX -The macro @code{WCHAR_MAX} evaluates to the maximum value representable -by an object of type @code{wint_t}. - -This macro got introduced in @w{Amendment 1} to @w{ISO C90}. -@end deftypevr - -Another special wide character value is the equivalent to @code{EOF}. - -@comment wchar.h -@comment ISO -@deftypevr Macro wint_t WEOF -The macro @code{WEOF} evaluates to a constant expression of type -@code{wint_t} whose value is different from any member of the extended -character set. - -@code{WEOF} need not be the same value as @code{EOF} and unlike -@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like - -@smallexample -@{ - int c; - ... - while ((c = getc (fp)) < 0) - ... -@} -@end smallexample - -@noindent -has to be rewritten to explicitly use @code{WEOF} when wide characters -are used. - -@smallexample -@{ - wint_t c; - ... - while ((c = wgetc (fp)) != WEOF) - ... -@} -@end smallexample - -@pindex wchar.h -This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is -defined in @file{wchar.h}. -@end deftypevr - - -These internal representations present problems when it comes to storing -and transmittal, since a single wide character consists of more -than one byte they are effected by byte-ordering. I.e., machines with -different endianesses would see different value accessing the same data. -This also applies for communication protocols which are all byte-based -and therefore the sender has to decide about splitting the wide -character in bytes. A last (but not least important) point is that wide -characters often require more storage space than an customized byte -oriented character set. - -@cindex multibyte character -@cindex EBCDIC - For all the above reasons, an external encoding which is different -from the internal encoding is often used if the latter is UCS-2 or UCS-4. -The external encoding is byte-based and can be chosen appropriately for -the environment and for the texts to be handled. There exist a variety -of different character sets which can be used for this external -encoding. Information which will not be exhaustively presented -here--instead, a description of the major groups will suffice. All of -the ASCII-based character sets fulfill one requirement: they are -"filesystem safe". This means that the character @code{'/'} is used in -the encoding @emph{only} to represent itself. Things are a bit -different for character sets like EBCDIC (Extended Binary Coded Decimal -Interchange Code, a character set family used by IBM) but if the -operation system does not understand EBCDIC directly the parameters to -system calls have to be converted first anyhow. - -@itemize @bullet -@item -The simplest character sets are single-byte character sets. There can -be only up to 256 characters (for @w{8 bit} character sets) which is not -sufficient to cover all languages but might be sufficient to handle a -specific text. Handling of @w{8 bit} character sets is simple. This is -not true for the other kinds presented later and therefore the -application one uses might require the use of @w{8 bit} character sets. - -@cindex ISO 2022 -@item -The @w{ISO 2022} standard defines a mechanism for extended character -sets where one character @emph{can} be represented by more than one -byte. This is achieved by associating a state with the text. Embedded -in the text can be characters which can be used to change the state. -Each byte in the text might have a different interpretation in each -state. The state might even influence whether a given byte stands for a -character on its own or whether it has to be combined with some more -bytes. - -@cindex EUC -@cindex Shift_JIS -@cindex SJIS -In most uses of @w{ISO 2022} the defined character sets do not allow -state changes which cover more than the next character. This has the -big advantage that whenever one can identify the beginning of the byte -sequence of a character one can interpret a text correctly. Examples of -character sets using this policy are the various EUC character sets -(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) -or Shift_JIS (SJIS, a Japanese encoding). - -But there are also character sets using a state which is valid for more -than one character and has to be changed by another byte sequence. -Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. - -@item -@cindex ISO 6937 -Early attempts to fix 8 bit character sets for other languages using the -Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes -representing characters like the acute accent do not produce output -themselves: one has to combine them with other characters to get the -desired result. E.g., the byte sequence @code{0xc2 0x61} (non-spacing -acute accent, following by lower-case `a') to get the ``small a with -acute'' character. To get the acute accent character on its own, one has -to write @code{0xc2 0x20} (the non-spacing acute followed by a space). - -This type of character set is used in some embedded systems such as -teletex. - -@item -@cindex UTF-8 -Instead of converting the Unicode or @w{ISO 10646} text used internally, -it is often also sufficient to simply use an encoding different than -UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an -encoding: UTF-8. This encoding is able to represent all of @w{ISO -10464} 31 bits in a byte string of length one to six. - -@cindex UTF-7 -There were a few other attempts to encode @w{ISO 10646} such as UTF-7 -but UTF-8 is today the only encoding which should be used. In fact, -UTF-8 will hopefully soon be the only external encoding that has to be -supported. It proves to be universally usable and the only disadvantage -is that it favors Roman languages by making the byte string -representation of other scripts (Cyrillic, Greek, Asian scripts) longer -than necessary if using a specific character set for these scripts. -Methods like the Unicode compression scheme can alleviate these -problems. -@end itemize - -The question remaining is: how to select the character set or encoding -to use. The answer: you cannot decide about it yourself, it is decided -by the developers of the system or the majority of the users. Since the -goal is interoperability one has to use whatever the other people one -works with use. If there are no constraints the selection is based on -the requirements the expected circle of users will have. I.e., if a -project is expected to only be used in, say, Russia it is fine to use -KOI8-R or a similar character set. But if at the same time people from, -say, Greece are participating one should use a character set which allows -all people to collaborate. - -The most widely useful solution seems to be: go with the most general -character set, namely @w{ISO 10646}. Use UTF-8 as the external encoding -and problems about users not being able to use their own language -adequately are a thing of the past. - -One final comment about the choice of the wide character representation -is necessary at this point. We have said above that the natural choice -is using Unicode or @w{ISO 10646}. This is not required, but at least -encouraged, by the @w{ISO C} standard. The standard defines at least a -macro @code{__STDC_ISO_10646__} that is only defined on systems where -the @code{wchar_t} type encodes @w{ISO 10646} characters. If this -symbol is not defined one should as much as possible avoid making -assumption about the wide character representation. If the programmer -uses only the functions provided by the C library to handle wide -character strings there should not be any compatibility problems with -other systems. - -@node Charset Function Overview -@section Overview about Character Handling Functions - -A Unix @w{C library} contains three different sets of functions in two -families to handle character set conversion. The one function family -is specified in the @w{ISO C} standard and therefore is portable even -beyond the Unix world. - -The most commonly known set of functions, coming from the @w{ISO C90} -standard, is unfortunately the least useful one. In fact, these -functions should be avoided whenever possible, especially when -developing libraries (as opposed to applications). - -The second family of functions got introduced in the early Unix standards -(XPG2) and is still part of the latest and greatest Unix standard: -@w{Unix 98}. It is also the most powerful and useful set of functions. -But we will start with the functions defined in @w{Amendment 1} to -@w{ISO C90}. - -@node Restartable multibyte conversion -@section Restartable Multibyte Conversion Functions - -The @w{ISO C} standard defines functions to convert strings from a -multibyte representation to wide character strings. There are a number -of peculiarities: - -@itemize @bullet -@item -The character set assumed for the multibyte encoding is not specified -as an argument to the functions. Instead the character set specified by -the @code{LC_CTYPE} category of the current locale is used; see -@ref{Locale Categories}. - -@item -The functions handling more than one character at a time require NUL -terminated strings as the argument. I.e., converting blocks of text -does not work unless one can add a NUL byte at an appropriate place. -The GNU C library contains some extensions the standard which allow -specifying a size but basically they also expect terminated strings. -@end itemize - -Despite these limitations the @w{ISO C} functions can very well be used -in many contexts. In graphical user interfaces, for instance, it is not -uncommon to have functions which require text to be displayed in a wide -character string if it is not simple ASCII. The text itself might come -from a file with translations and the user should decide about the -current locale which determines the translation and therefore also the -external encoding used. In such a situation (and many others) the -functions described here are perfect. If more freedom while performing -the conversion is necessary take a look at the @code{iconv} functions -(@pxref{Generic Charset Conversion}). - -@menu -* Selecting the Conversion:: Selecting the conversion and its properties. -* Keeping the state:: Representing the state of the conversion. -* Converting a Character:: Converting Single Characters. -* Converting Strings:: Converting Multibyte and Wide Character - Strings. -* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. -@end menu - -@node Selecting the Conversion -@subsection Selecting the conversion and its properties - -We already said above that the currently selected locale for the -@code{LC_CTYPE} category decides about the conversion which is performed -by the functions we are about to describe. Each locale uses its own -character set (given as an argument to @code{localedef}) and this is the -one assumed as the external multibyte encoding. The wide character -character set always is UCS-4, at least on GNU systems. - -A characteristic of each multibyte character set is the maximum number -of bytes which can be necessary to represent one character. This -information is quite important when writing code which uses the -conversion functions. In the examples below we will see some examples. -The @w{ISO C} standard defines two macros which provide this information. - - -@comment limits.h -@comment ISO -@deftypevr Macro int MB_LEN_MAX -This macro specifies the maximum number of bytes in the multibyte -sequence for a single character in any of the supported locales. It is -a compile-time constant and it is defined in @file{limits.h}. -@pindex limits.h -@end deftypevr - -@comment stdlib.h -@comment ISO -@deftypevr Macro int MB_CUR_MAX -@code{MB_CUR_MAX} expands into a positive integer expression that is the -maximum number of bytes in a multibyte character in the current locale. -The value is never greater than @code{MB_LEN_MAX}. Unlike -@code{MB_LEN_MAX} this macro need not be a compile-time constant and in -fact, in the GNU C library it is not. - -@pindex stdlib.h -@code{MB_CUR_MAX} is defined in @file{stdlib.h}. -@end deftypevr - -Two different macros are necessary since strictly @w{ISO C90} compilers -do not allow variable length array definitions but still it is desirable -to avoid dynamic allocation. This incomplete piece of code shows the -problem: - -@smallexample -@{ - char buf[MB_LEN_MAX]; - ssize_t len = 0; - - while (! feof (fp)) - @{ - fread (&buf[len], 1, MB_CUR_MAX - len, fp); - /* @r{... process} buf */ - len -= used; - @} -@} -@end smallexample - -The code in the inner loop is expected to have always enough bytes in -the array @var{buf} to convert one multibyte character. The array -@var{buf} has to be sized statically since many compilers do not allow a -variable size. The @code{fread} call makes sure that always -@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it isn't -a problem if @code{MB_CUR_MAX} is not a compile-time constant. - - -@node Keeping the state -@subsection Representing the state of the conversion - -@cindex stateful -In the introduction of this chapter it was said that certain character -sets use a @dfn{stateful} encoding. I.e., the encoded values depend in -some way on the previous bytes in the text. - -Since the conversion functions allow converting a text in more than one -step we must have a way to pass this information from one call of the -functions to another. - -@comment wchar.h -@comment ISO -@deftp {Data type} mbstate_t -@cindex shift state -A variable of type @code{mbstate_t} can contain all the information -about the @dfn{shift state} needed from one call to a conversion -function to another. - -@pindex wchar.h -This type is defined in @file{wchar.h}. It got introduced in -@w{Amendment 1} to @w{ISO C90}. -@end deftp - -To use objects of this type the programmer has to define such objects -(normally as local variables on the stack) and pass a pointer to the -object to the conversion functions. This way the conversion function -can update the object if the current multibyte character set is -stateful. - -There is no specific function or initializer to put the state object in -any specific state. The rules are that the object should always -represent the initial state before the first use and this is achieved by -clearing the whole variable with code such as follows: - -@smallexample -@{ - mbstate_t state; - memset (&state, '\0', sizeof (state)); - /* @r{from now on @var{state} can be used.} */ - ... -@} -@end smallexample - -When using the conversion functions to generate output it is often -necessary to test whether the current state corresponds to the initial -state. This is necessary, for example, to decide whether or not to emit -escape sequences to set the state to the initial state at certain -sequence points. Communication protocols often require this. - -@comment wchar.h -@comment ISO -@deftypefun int mbsinit (const mbstate_t *@var{ps}) -This function determines whether the state object pointed to by @var{ps} -is in the initial state or not. If @var{ps} is a null pointer or the -object is in the initial state the return value is nonzero. Otherwise -it is zero. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -Code using this function often looks similar to this: - -@c Fix the example to explicitly say how to generate the escape sequence -@c to restore the initial state. -@smallexample -@{ - mbstate_t state; - memset (&state, '\0', sizeof (state)); - /* @r{Use @var{state}.} */ - ... - if (! mbsinit (&state)) - @{ - /* @r{Emit code to return to initial state.} */ - const wchar_t empty[] = L""; - const wchar_t *srcp = empty; - wcsrtombs (outbuf, &srcp, outbuflen, &state); - @} - ... -@} -@end smallexample - -The code to emit the escape sequence to get back to the initial state is -interesting. The @code{wcsrtombs} function can be used to determine the -necessary output code (@pxref{Converting Strings}). Please note that on -GNU systems it is not necessary to perform this extra action for the -conversion from multibyte text to wide character text since the wide -character encoding is not stateful. But there is nothing mentioned in -any standard which prohibits making @code{wchar_t} using a stateful -encoding. - -@node Converting a Character -@subsection Converting Single Characters - -The most fundamental of the conversion functions are those dealing with -single characters. Please note that this does not always mean single -bytes. But since there is very often a subset of the multibyte -character set which consists of single byte sequences there are -functions to help with converting bytes. One very important and often -applicable scenario is where ASCII is a subpart of the multibyte -character set. I.e., all ASCII characters stand for itself and all -other characters have at least a first byte which is beyond the range -@math{0} to @math{127}. - -@comment wchar.h -@comment ISO -@deftypefun wint_t btowc (int @var{c}) -The @code{btowc} function (``byte to wide character'') converts a valid -single byte character @var{c} in the initial shift state into the wide -character equivalent using the conversion rules from the currently -selected locale of the @code{LC_CTYPE} category. - -If @code{(unsigned char) @var{c}} is no valid single byte multibyte -character or if @var{c} is @code{EOF} the function returns @code{WEOF}. - -Please note the restriction of @var{c} being tested for validity only in -the initial shift state. There is no @code{mbstate_t} object used from -which the state information is taken and the function also does not use -any static state. - -@pindex wchar.h -This function was introduced in @w{Amendment 1} to @w{ISO C90} and -is declared in @file{wchar.h}. -@end deftypefun - -Despite the limitation that the single byte value always is interpreted -in the initial state this function is actually useful most of the time. -Most characters are either entirely single-byte character sets or they -are extension to ASCII. But then it is possible to write code like this -(not that this specific example is very useful): - -@smallexample -wchar_t * -itow (unsigned long int val) -@{ - static wchar_t buf[30]; - wchar_t *wcp = &buf[29]; - *wcp = L'\0'; - while (val != 0) - @{ - *--wcp = btowc ('0' + val % 10); - val /= 10; - @} - if (wcp == &buf[29]) - *--wcp = L'0'; - return wcp; -@} -@end smallexample - |
