diff options
| author | Ulrich Drepper <drepper@redhat.com> | 2001-11-07 07:21:22 +0000 |
|---|---|---|
| committer | Ulrich Drepper <drepper@redhat.com> | 2001-11-07 07:21:22 +0000 |
| commit | bd3916e8fb78089adb44cdc7ec8b737e11a9a0d6 (patch) | |
| tree | ea51f9dcbb51c285c70495e60a6a9288bbbb8c49 | |
| parent | 7982ecfe469a67ce5b249da9d6b24a8d6103fc6f (diff) | |
| download | glibc-bd3916e8fb78089adb44cdc7ec8b737e11a9a0d6.tar.xz glibc-bd3916e8fb78089adb44cdc7ec8b737e11a9a0d6.zip | |
Update.
2001-11-07 Kaoru Fukui <k_fukui@highway.ne.jp>
* manual/charset.texi: Fix typo @w[ISO 6937] to @w{ISO 6937}.
Also fix typo @code {mbsinit} to @code{mbsinit}.
| -rw-r--r-- | ChangeLog | 5 | ||||
| -rw-r--r-- | manual/charset.texi | 587 |
2 files changed, 298 insertions, 294 deletions
@@ -1,3 +1,8 @@ +2001-11-07 Kaoru Fukui <k_fukui@highway.ne.jp> + + * manual/charset.texi: Fix typo @w[ISO 6937] to @w{ISO 6937}. + Also fix typo @code {mbsinit} to @code{mbsinit}. + 2001-11-06 Ulrich Drepper <drepper@redhat.com> * elf/dl-profile.c: Replace state variable with simple flag named diff --git a/manual/charset.texi b/manual/charset.texi index bae2910236..4fb58d1cac 100644 --- a/manual/charset.texi +++ b/manual/charset.texi @@ -102,8 +102,8 @@ those functions that take a single wide character. @comment ISO @deftp {Data type} wchar_t This data type is used as the base type for wide character strings. -In other words, arrays of objects of this type are the equivalent of -@code{char[]} for multibyte character strings. The type is defined in +In other words, arrays of objects of this type are the equivalent of +@code{char[]} for multibyte character strings. The type is defined in @file{stddef.h}. The @w{ISO C90} standard, where @code{wchar_t} was introduced, does not @@ -171,7 +171,7 @@ The macro @code{WEOF} evaluates to a constant expression of type character set. @code{WEOF} need not be the same value as @code{EOF} and unlike -@code{EOF} it also need @emph{not} be negative. In other words, sloppy +@code{EOF} it also need @emph{not} be negative. In other words, sloppy code like @smallexample @@ -214,29 +214,28 @@ than a customized byte-oriented character set. @cindex multibyte character @cindex EBCDIC - For all the above reasons, an external encoding that is different -from the internal encoding is often used if the latter is UCS-2 or UCS-4. +For all the above reasons, an external encoding that is different from +the internal encoding is often used if the latter is UCS-2 or UCS-4. The external encoding is byte-based and can be chosen appropriately for the environment and for the texts to be handled. A variety of different character sets can be used for this external encoding (information that will not be exhaustively presented here--instead, a description of the major groups will suffice). All of the ASCII-based character sets -[_bkoz_: do you mean Roman character sets? If not, what do you mean -here?] fulfill one requirement: they are "filesystem safe." This means -that the character @code{'/'} is used in the encoding @emph{only} to +fulfill one requirement: they are "filesystem safe." This means that +the character @code{'/'} is used in the encoding @emph{only} to represent itself. Things are a bit different for character sets like EBCDIC (Extended Binary Coded Decimal Interchange Code, a character set family used by IBM), but if the operation system does not understand -EBCDIC directly the parameters-to-system calls have to be converted first -anyhow. +EBCDIC directly the parameters-to-system calls have to be converted +first anyhow. @itemize @bullet -@item -The simplest character sets are single-byte character sets. There can -be only up to 256 characters (for @w{8 bit} character sets), which is -not sufficient to cover all languages but might be sufficient to handle -a specific text. Handling of a @w{8 bit} character sets is simple. This -is not true for other kinds presented later, and therefore, the +@item +The simplest character sets are single-byte character sets. There can +be only up to 256 characters (for @w{8 bit} character sets), which is +not sufficient to cover all languages but might be sufficient to handle +a specific text. Handling of a @w{8 bit} character sets is simple. This +is not true for other kinds presented later, and therefore, the application one uses might require the use of @w{8 bit} character sets. @cindex ISO 2022 @@ -277,7 +276,7 @@ a with acute'' character. To get the acute accent character on its own, one has to write @code{0xc2 0x20} (the non-spacing acute followed by a space). -Character sets like @w[ISO 6937] are used in some embedded systems such +Character sets like @w{ISO 6937} are used in some embedded systems such as teletex. @item @@ -330,13 +329,13 @@ be no compatibility problems with other systems. @node Charset Function Overview @section Overview about Character Handling Functions -A Unix @w{C library} contains three different sets of functions in two -families to handle character set conversion. One of the function families -(the most commonly used) is specified in the @w{ISO C90} standard and, -therefore, is portable even beyond the Unix world. Unfortunately this -family is the least useful one. These functions should be avoided -whenever possible, especially when developing libraries (as opposed to -applications). +A Unix @w{C library} contains three different sets of functions in two +families to handle character set conversion. One of the function families +(the most commonly used) is specified in the @w{ISO C90} standard and, +therefore, is portable even beyond the Unix world. Unfortunately this +family is the least useful one. These functions should be avoided +whenever possible, especially when developing libraries (as opposed to +applications). The second family of functions got introduced in the early Unix standards (XPG2) and is still part of the latest and greatest Unix standard: @@ -361,7 +360,7 @@ the @code{LC_CTYPE} category of the current locale is used; see @item The functions handling more than one character at a time require NUL terminated strings as the argument (i.e., converting blocks of text -does not work unless one can add a NUL byte at an appropriate place). +does not work unless one can add a NUL byte at an appropriate place). The GNU C library contains some extensions to the standard that allow specifying a size, but basically they also expect terminated strings. @end itemize @@ -369,7 +368,7 @@ specifying a size, but basically they also expect terminated strings. Despite these limitations the @w{ISO C} functions can be used in many contexts. In graphical user interfaces, for instance, it is not uncommon to have functions that require text to be displayed in a wide -character string if the text is not simple ASCII. The text itself might +character string if the text is not simple ASCII. The text itself might come from a file with translations and the user should decide about the current locale, which determines the translation and therefore also the external encoding used. In such a situation (and many others) the @@ -418,7 +417,7 @@ a compile-time constant and is defined in @file{limits.h}. @code{MB_CUR_MAX} expands into a positive integer expression that is the maximum number of bytes in a multibyte character in the current locale. The value is never greater than @code{MB_LEN_MAX}. Unlike -@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in +@code{MB_LEN_MAX} this macro need not be a compile-time constant, and in the GNU C library it is not. @pindex stdlib.h @@ -447,7 +446,7 @@ problem: The code in the inner loop is expected to have always enough bytes in the array @var{buf} to convert one multibyte character. The array @var{buf} has to be sized statically since many compilers do not allow a -variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} +variable size. The @code{fread} call makes sure that @code{MB_CUR_MAX} bytes are always available in @var{buf}. Note that it isn't a problem if @code{MB_CUR_MAX} is not a compile-time constant. @@ -457,7 +456,7 @@ a problem if @code{MB_CUR_MAX} is not a compile-time constant. @cindex stateful In the introduction of this chapter it was said that certain character -sets use a @dfn{stateful} encoding. That is, the encoded values depend +sets use a @dfn{stateful} encoding. That is, the encoded values depend in some way on the previous bytes in the text. Since the conversion functions allow converting a text in more than one @@ -477,8 +476,8 @@ function to another. @w{Amendment 1} to @w{ISO C90}. @end deftp -To use objects of type @code{mbstate_t} the programmer has to define such -objects (normally as local variables on the stack) and pass a pointer to +To use objects of type @code{mbstate_t} the programmer has to define such +objects (normally as local variables on the stack) and pass a pointer to the object to the conversion functions. This way the conversion function can update the object if the current multibyte character set is stateful. @@ -505,17 +504,17 @@ sequence points. Communication protocols often require this. @comment wchar.h @comment ISO @deftypefun int mbsinit (const mbstate_t *@var{ps}) -The @code {mbsinit} function determines whether the state object pointed -to by @var{ps} is in the initial state. If @var{ps} is a null pointer or -the object is in the initial state the return value is nonzero. Otherwise +The @code{mbsinit} function determines whether the state object pointed +to by @var{ps} is in the initial state. If @var{ps} is a null pointer or +the object is in the initial state the return value is nonzero. Otherwise it is zero. @pindex wchar.h -@code {mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is +@code{mbsinit} was introduced in @w{Amendment 1} to @w{ISO C90} and is declared in @file{wchar.h}. @end deftypefun -Code using @code {mbsinit} often looks similar to this: +Code using @code{mbsinit} often looks similar to this: @c Fix the example to explicitly say how to generate the escape sequence @c to restore the initial state. @@ -552,9 +551,9 @@ The most fundamental of the conversion functions are those dealing with single characters. Please note that this does not always mean single bytes. But since there is very often a subset of the multibyte character set that consists of single byte sequences, there are -functions to help with converting bytes. Frequently, ASCII is a subpart -of the multibyte character set. In such a scenario, each ASCII character -stands for itself, and all other characters have at least a first byte +functions to help with converting bytes. Frequently, ASCII is a subpart +of the multibyte character set. In such a scenario, each ASCII character +stands for itself, and all other characters have at least a first byte that is beyond the range @math{0} to @math{127}. @comment wchar.h @@ -574,7 +573,7 @@ which the state information is taken, and the function also does not use any static state. @pindex wchar.h -The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} +The @code{btowc} function was introduced in @w{Amendment 1} to @w{ISO C90} and is declared in @file{wchar.h}. @end deftypefun @@ -661,7 +660,7 @@ If the first @var{n} bytes of the multibyte string possibly form a valid multibyte character but there are more than @var{n} bytes needed to complete it, the return value of the function is @code{(size_t) -2} and no value is stored. Please note that this can happen even if @var{n} -has a value greater than or equal to @code{MB_CUR_MAX} since the input +has a value greater than or equal to @code{MB_CUR_MAX} since the input might contain redundant shift sequences. If the first @code{n} bytes of the multibyte string cannot possibly form @@ -707,23 +706,23 @@ mbstouwcs (const char *s) The use of @code{mbrtowc} should be clear. A single wide character is stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored -in the variable @var{nbytes}. If the conversion is successful, the -uppercase variant of the wide character is stored in the @var{result} -array and the pointer to the input string and the number of available +in the variable @var{nbytes}. If the conversion is successful, the +uppercase variant of the wide character is stored in the @var{result} +array and the pointer to the input string and the number of available bytes is adjusted. -The only non-obvious thing about @code{mbrtowc} might be the way memory -is allocated for the result. The above code uses the fact that there +The only non-obvious thing about @code{mbrtowc} might be the way memory +is allocated for the result. The above code uses the fact that there can never be more wide characters in the converted results than there are -bytes in the multibyte input string. This method yields a pessimistic -guess about the size of the result, and if many wide character strings -have to be constructed this way or if the strings are long, the extra -memory required to be allocated because the input string contains -multibyte characters might be significant. The allocated memory block can -be resized to the correct size before returning it, but a better solution -might be to allocate just the right amount of space for the result right -away. Unfortunately there is no function to compute the length of the wide -character string directly from the multibyte string. There is, however, a +bytes in the multibyte input string. This method yields a pessimistic +guess about the size of the result, and if many wide character strings +have to be constructed this way or if the strings are long, the extra +memory required to be allocated because the input string contains +multibyte characters might be significant. The allocated memory block can +be resized to the correct size before returning it, but a better solution +might be to allocate just the right amount of space for the result right +away. Unfortunately there is no function to compute the length of the wide +character string directly from the multibyte string. There is, however, a function that does part of the work. @comment wchar.h @@ -739,8 +738,8 @@ multibyte character, the number of bytes belonging to this multibyte character byte sequence is returned. If the the first @var{n} bytes possibly form a valid multibyte -character but the character is incomplete, the return value is -@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid +character but the character is incomplete, the return value is +@code{(size_t) -2}. Otherwise the multibyte character sequence is invalid and the return value is @code{(size_t) -1}. The multibyte sequence is interpreted in the state represented by the @@ -752,7 +751,7 @@ object local to @code{mbrlen} is used. is declared in @file{wchar.h}. @end deftypefun -The attentive reader now will note that @code{mbrlen} can be implemented +The attentive reader now will note that @code{mbrlen} can be implemented as @smallexample @@ -787,10 +786,10 @@ mbslen (const char *s) This function simply calls @code{mbrlen} for each multibyte character in the string and counts the number of function calls. Please note that we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} -call. This is acceptable since a) this value is larger then the length of -the longest multibyte character sequence and b) we know that the string -@var{s} ends with a NUL byte, which cannot be part of any other multibyte -character sequence but the one representing the NUL wide character. +call. This is acceptable since a) this value is larger then the length of +the longest multibyte character sequence and b) we know that the string +@var{s} ends with a NUL byte, which cannot be part of any other multibyte +character sequence but the one representing the NUL wide character. Therefore, the @code{mbrlen} function will never read invalid memory. Now that this function is available (just to make this clear, this @@ -803,10 +802,10 @@ wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); @end smallexample Please note that the @code{mbslen} function is quite inefficient. The -implementation of @code{mbstouwcs} with @code{mbslen} would have to -perform the conversion of the multibyte character input string twice, and -this conversion might be quite expensive. So it is necessary to think -about the consequences of using the easier but imprecise method before +implementation of @code{mbstouwcs} with @code{mbslen} would have to +perform the conversion of the multibyte character input string twice, and +this conversion might be quite expensive. So it is necessary to think +about the consequences of using the easier but imprecise method before doing the work twice. @comment wchar.h @@ -831,15 +830,15 @@ writes into an internal buffer, which is guaranteed to be large enough. If @var{wc} is the NUL wide character, @code{wcrtomb} emits, if necessary, a shift sequence to get the state @var{ps} into the initial -state followed by a single NUL byte, which is stored in the string +state followed by a single NUL byte, which is stored in the string @var{s}. -Otherwise a byte sequence (possibly including shift sequences) is written -into the string @var{s}. This only happens if @var{wc} is a valid wide -character (i.e., it has a multibyte representation in the character set -selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no -valid wide character, nothing is stored in the strings @var{s}, -@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} +Otherwise a byte sequence (possibly including shift sequences) is written +into the string @var{s}. This only happens if @var{wc} is a valid wide +character (i.e., it has a multibyte representation in the character set +selected by locale of the @code{LC_CTYPE} category). If @var{wc} is no +valid wide character, nothing is stored in the strings @var{s}, +@code{errno} is set to @code{EILSEQ}, the conversion state in @var{ps} is undefined and the return value is @code{(size_t) -1}. If no error occurred the function returns the number of bytes stored in @@ -907,8 +906,8 @@ abort if there are not at least @code{MB_CUR_LEN} bytes available. This is not always optimal but we have no other choice. We might have less than @code{MB_CUR_LEN} bytes available but the next multibyte character might also be only one byte long. At the time the @code{wcrtomb} call -returns it is too late to decide whether the buffer was large enough. If -this solution is unsuitable, there is a very slow but more accurate +returns it is too late to decide whether the buffer was large enough. If +this solution is unsuitable, there is a very slow but more accurate solution. @smallexample @@ -929,15 +928,15 @@ solution. ... @end smallexample -Here we perform the conversion that might overflow the buffer so that -we are afterwards in the position to make an exact decision about the -buffer size. Please note the @code{NULL} argument for the destination -buffer in the new @code{wcrtomb} call; since we are not interested in the -converted text at this point, this is a nice way to express this. The -most unusual thing about this piece of code certainly is the duplication -of the conversion state object, but if a change of the state is necessary -to emit the next multibyte character, we want to have the same shift state -change performed in the real conversion. Therefore, we have to preserve +Here we perform the conversion that might overflow the buffer so that +we are afterwards in the position to make an exact decision about the +buffer size. Please note the @code{NULL} argument for the destination +buffer in the new @code{wcrtomb} call; since we are not interested in the +converted text at this point, this is a nice way to express this. The +most unusual thing about this piece of code certainly is the duplication +of the conversion state object, but if a change of the state is necessary +to emit the next multibyte character, we want to have the same shift state +change performed in the real conversion. Therefore, we have to preserve the initial shift state information. There are certainly many more and even better solutions to this problem. @@ -962,7 +961,7 @@ string at @code{*@var{src}} into an equivalent wide character string, including the NUL wide character at the end. The conversion is started using the state information from the object pointed to by @var{ps} or from an internal object of @code{mbsrtowcs} if @var{ps} is a null -pointer. Before returning, the state object is updated to match the state +pointer. Before returning, the state object is updated to match the state after the last converted character. The state is the initial state if the terminating NUL byte is reached and converted. @@ -986,7 +985,7 @@ returns @code{(size_t) -1}. In all other cases the function returns the number of wide characters converted during this call. If @var{dst} is not null, @code{mbsrtowcs} -stores in the pointer pointed to by @var{src} either a null pointer (if +stores in the pointer pointed to by @var{src} either a null pointer (if the NUL byte in the input string was reached) or the address of the byte following the last converted multibyte character. @@ -995,8 +994,8 @@ following the last converted multibyte character. declared in @file{wchar.h}. @end deftypefun -The definition of the @code{mbsrtowcs} function has one important -limitation. The requirement that @var{dst} has to be a NUL-terminated +The definition of the @code{mbsrtowcs} function has one important +limitation. The requirement that @var{dst} has to be a NUL-terminated string provides problems if one wants to convert buffers with text. A buffer is normally no collection of NUL-terminated strings but instead a continuous collection of lines, separated by newline characters. Now @@ -1006,10 +1005,10 @@ into the unmodified text buffer. This means, either one inserts the NUL byte at the appropriate place for the time of the @code{mbsrtowcs} function call (which is not doable for a read-only buffer or in a multi-threaded application) or one copies the line in an extra buffer -where it can be terminated by a NUL byte. Note that it is not in general -possible to limit the number of characters to convert by setting the -parameter @var{len} to any specific value. Since it is not known how -many bytes each multibyte character sequence is in length, one can only +where it can be terminated by a NUL byte. Note that it is not in general +possible to limit the number of characters to convert by setting the +parameter @var{len} to any specific value. Since it is not known how +many bytes each multibyte character sequence is in length, one can only guess. @cindex stateful @@ -1026,7 +1025,7 @@ accessible to the user since the conversion stops after the NUL byte (which resets the state). Most stateful character sets in use today require that the shift state after a newline be the initial state--but this is not a strict guarantee. Therefore, simply NUL-terminating a -piece of a running text is not always an adequate solution and, +piece of a running text is not always an adequate solution and, therefore, should never be used in generally used code. The generic conversion interface (@pxref{Generic Charset Conversion}) @@ -1042,14 +1041,14 @@ length and passing this length to the function. @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) The @code{wcsrtombs} function (``wide character string restartable to multibyte string'') converts the NUL-terminated wide character string at -@code{*@var{src}} into an equivalent multibyte character string and +@code{*@var{src}} into an equivalent multibyte character string and stores the result in the array pointed to by @var{dst}. The NUL wide character is also converted. The conversion starts in the state described in the object pointed to by @var{ps} or by a state object locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If @var{dst} is a null pointer, the conversion is performed as usual but the result is not available. If all characters of the input string were -successfully converted and if @var{dst} is not a null pointer, the +successfully converted and if @var{dst} is not a null pointer, the pointer pointed to by @var{src} gets assigned a null pointer. If one of the wide characters in the input string has no valid multibyte @@ -1063,23 +1062,23 @@ pointer and the next converted character would require more than assigned a value pointing to the wide character right after the last one successfully converted. -Except in the case of an encoding error the return value of the -@code{wcsrtombs} function is the number of bytes in all the multibyte -character sequences stored in @var{dst}. Before returning the state in -the object pointed to by @var{ps} (or the internal object in case -@var{ps} is a null pointer) is updated to reflect the state after the -last conversion. The state is the initial shift state in case the +Except in the case of an encoding error the return value of the +@code{wcsrtombs} function is the number of bytes in all the multibyte +character sequences stored in @var{dst}. Before returning the state in +the object pointed to by @var{ps} (or the internal object in case +@var{ps} is a null pointer) is updated to reflect the state after the +last conversion. The state is the initial shift state in case the terminating NUL wide character was converted. @pindex wchar.h -The @code{wcsrtombs} function was introduced in @w{Amendment 1} to +The @code{wcsrtombs} function was introduced in @w{Amendment 1} to @w{ISO C90} and is declared in @file{wchar.h}. @end deftypefun The restriction mentioned above for the @code{mbsrtowcs} function applies here also. There is no possibility of directly controlling the number of -input characters. One has to place the NUL wide character at the correct -place or control the consumed input indirectly via the available output +input characters. One has to place the NUL wide character at the correct +place or control the consumed input indirectly via the available output array size (the @var{len} parameter). @comment wchar.h @@ -1090,9 +1089,9 @@ function. All the parameters are the same except for @var{nmc}, which is new. The return value is the same as for @code{mbsrtowcs}. This new parameter specifies how many bytes at most can be used from the -multibyte character string. In other words, the multibyte character -string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte -is found within the @var{nmc} first bytes of the string, the conversion +multibyte character string. In other words, the multibyte character +string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte +is found within the @var{nmc} first bytes of the string, the conversion stops here. This function is a GNU extension. It is meant to work around the @@ -1147,8 +1146,8 @@ No more than @var{nwc} wide characters from the input string wide character in the first @var{nwc} characters, the conversion stops at this place. -The @code{wcsnrtombs} function is a GNU extension and just like -@code{mbsnrtowcs} helps in situations where no NUL-terminated input +The @code{wcsnrtombs} function is a GNU extension and just like +@code{mbsnrtowcs} helps in situations where no NUL-terminated input strings are available. @end deftypefun @@ -1247,25 +1246,25 @@ file_mbsrtowcs (int input, int output) @section Non-reentrant Conversion Function The functions described in the previous chapter are defined in -@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard -also contained functions for character set conversion. The reason that -these original functions are not described first is that they are almost +@w{Amendment 1} to @w{ISO C90}, but the original @w{ISO C90} standard +also contained functions for chara |
