From 390955cbdeb674bead490fc3f74a8a0893ea83cf Mon Sep 17 00:00:00 2001 From: Ulrich Drepper Date: Mon, 11 Jan 1999 20:13:43 +0000 Subject: Update. 1999-01-11 Ulrich Drepper * ctype/Versions [GLIBC_2.0]: Export __ctype32_b. * include/wctype.h: Declare __iswctype. * stdio-common/vfscanf.c (__vfscanf): Use __iswspace instead of iswspace. * wctype/Makefile (routines): Add wcextra_l. * wctype/wcextra.c (iswblank): Implement function here and don't use __iswctype. (__iswblank_l): Move definition to... * wctype/wcextra_l.c: ...here. New file. * wctype/wcfuncs.c: Really implement functions and don't call __iswctype or __towctrans. * wctype/wctype.h: Change isw* and tow* macros. Don't call __iswctype or __towctrans. Instead optimize constant argument case. * iconv/gconv.h: Fix typos. * iconv/skeleton.c: Fix typos. Optimize init function a bit. Correctly emit escape sequence to return to initial state in conversion function. * iconvdata/iso-2022-jp.c (gconv_init): Correctly initialize max_needed_to element. * manual/mbyte.texi: Removed. This is now described in charset.texi. * manual/charset.texi: New file. * manual/Makefile (chapters): Replace mbyte by charset. * manual/ctype.texi: Document wide character functions. * manual/intro.texi: Fix reference to mbyte chapter. * manual/lang.texi: Likewise. * manual/locale.texi: Likewise. * manual/stdio.texi: Likewise. * manual/string.texi: Fix @node line for new charset chapter. * manual/libc.texinfo (UPDATED): Updated. Also update copyright years. * manual/memory.texi (savestring): Optimize code to give a good example. * manual/filesys.texi: Fix wording. Patches by Jim Meyering. * nscd/nscd_getgr_r.c: Include stdint.h to get uintptr_t definition. * nscd/nscd_getpw_r.c: Likewise. * nscd/nscd_gethst_r.c: Likewise. * stdlib/stdtold_l.c: Always include xlocale.h. 1999-01-11 Geoffrey Keating * stdlib/fpioconst.h (LDBL_MAX_10_EXP_LOG): Define to be same as DBL_MAX_10_EXP_LOG if there is no long double. (_fpioconst_pow10): Always use size as LDBL_MAX_10_EXP_LOG to match printf_fp.c. 1999-01-10 Andreas Jaeger * timezone/Makefile ($(testdata)/GB): Changed to ... ($(testdata)/Europe/London): ... for tst-timezone test. ($(objpfx)tst-timezone.out): Change GB to Europe/London. * timezone/tst-timezone.c (main): Enable DST switching test, change GB to Europe/London. 1999-01-10 Philip Blundell * socket/Makefile (headers): Remove bits/sockunion.h. 1999-01-09 Philip Blundell * socket/sys/socket.h: Don't include . * sysdeps/generic/bits/sockunion.h: Deleted. * sysdeps/unix/sysv/linux/bits/sockunion.h: Likewise. 1999-01-08 H.J. Lu * io/fts.c (fts_close): Don't access memory after having it freed. --- ChangeLog | 76 + ctype/Versions | 3 +- iconv/gconv.h | 6 +- iconv/skeleton.c | 44 +- iconvdata/iso-2022-jp.c | 6 +- include/wctype.h | 6 + io/fts.c | 12 +- manual/Makefile | 4 +- manual/chapters.texi | 3 +- manual/charset.texi | 2846 ++++++++++++++++++++++++++++++ manual/ctype.texi | 521 +++++- manual/filesys.texi | 4 +- manual/intro.texi | 2 +- manual/lang.texi | 2 +- manual/libc.texinfo | 4 +- manual/locale.texi | 6 +- manual/memory.texi | 3 +- manual/stdio.texi | 8 +- manual/string.texi | 2 +- manual/texis | 2 +- manual/top-menu.texi | 70 +- nscd/nscd_getgr_r.c | 3 +- nscd/nscd_gethst_r.c | 3 +- nscd/nscd_getpw_r.c | 3 +- socket/Makefile | 5 +- socket/sys/socket.h | 5 +- stdio-common/vfscanf.c | 4 +- stdlib/fpioconst.h | 12 +- stdlib/strtold_l.c | 4 +- sysdeps/generic/bits/sockunion.h | 40 - sysdeps/unix/sysv/linux/bits/sockunion.h | 48 - timezone/Makefile | 7 +- timezone/tst-timezone.c | 12 +- wctype/Makefile | 4 +- wctype/wcextra.c | 18 +- wctype/wcextra_l.c | 43 + wctype/wcfuncs.c | 50 +- wctype/wctype.h | 108 +- 38 files changed, 3736 insertions(+), 263 deletions(-) create mode 100644 manual/charset.texi delete mode 100644 sysdeps/generic/bits/sockunion.h delete mode 100644 sysdeps/unix/sysv/linux/bits/sockunion.h create mode 100644 wctype/wcextra_l.c diff --git a/ChangeLog b/ChangeLog index 159bd65c51..0515d68376 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,79 @@ +1999-01-11 Ulrich Drepper + + * ctype/Versions [GLIBC_2.0]: Export __ctype32_b. + * include/wctype.h: Declare __iswctype. + * stdio-common/vfscanf.c (__vfscanf): Use __iswspace instead of + iswspace. + * wctype/Makefile (routines): Add wcextra_l. + * wctype/wcextra.c (iswblank): Implement function here and don't use + __iswctype. + (__iswblank_l): Move definition to... + * wctype/wcextra_l.c: ...here. New file. + * wctype/wcfuncs.c: Really implement functions and don't call + __iswctype or __towctrans. + * wctype/wctype.h: Change isw* and tow* macros. Don't call + __iswctype or __towctrans. Instead optimize constant argument case. + + * iconv/gconv.h: Fix typos. + + * iconv/skeleton.c: Fix typos. Optimize init function a bit. + Correctly emit escape sequence to return to initial state in + conversion function. + + * iconvdata/iso-2022-jp.c (gconv_init): Correctly initialize + max_needed_to element. + + * manual/mbyte.texi: Removed. This is now described in charset.texi. + * manual/charset.texi: New file. + * manual/Makefile (chapters): Replace mbyte by charset. + * manual/ctype.texi: Document wide character functions. + * manual/intro.texi: Fix reference to mbyte chapter. + * manual/lang.texi: Likewise. + * manual/locale.texi: Likewise. + * manual/stdio.texi: Likewise. + * manual/string.texi: Fix @node line for new charset chapter. + * manual/libc.texinfo (UPDATED): Updated. Also update copyright years. + * manual/memory.texi (savestring): Optimize code to give a good + example. + + * manual/filesys.texi: Fix wording. Patches by Jim Meyering. + + * nscd/nscd_getgr_r.c: Include stdint.h to get uintptr_t definition. + * nscd/nscd_getpw_r.c: Likewise. + * nscd/nscd_gethst_r.c: Likewise. + + * stdlib/stdtold_l.c: Always include xlocale.h. + +1999-01-11 Geoffrey Keating + + * stdlib/fpioconst.h (LDBL_MAX_10_EXP_LOG): Define to be same as + DBL_MAX_10_EXP_LOG if there is no long double. + (_fpioconst_pow10): Always use size as LDBL_MAX_10_EXP_LOG to match + printf_fp.c. + +1999-01-10 Andreas Jaeger + + * timezone/Makefile ($(testdata)/GB): Changed to ... + ($(testdata)/Europe/London): ... for tst-timezone test. + ($(objpfx)tst-timezone.out): Change GB to Europe/London. + + * timezone/tst-timezone.c (main): Enable DST switching test, + change GB to Europe/London. + +1999-01-10 Philip Blundell + + * socket/Makefile (headers): Remove bits/sockunion.h. + +1999-01-09 Philip Blundell + + * socket/sys/socket.h: Don't include . + * sysdeps/generic/bits/sockunion.h: Deleted. + * sysdeps/unix/sysv/linux/bits/sockunion.h: Likewise. + +1999-01-08 H.J. Lu + + * io/fts.c (fts_close): Don't access memory after having it freed. + 1998-01-08 Andreas Schwab * manual/Makefile (stamp-summary): Remove space after -t option diff --git a/ctype/Versions b/ctype/Versions index 56647bd784..6110f848c8 100644 --- a/ctype/Versions +++ b/ctype/Versions @@ -1,7 +1,8 @@ libc { GLIBC_2.0 { # global variables - __ctype_b; __ctype_tolower; __ctype_toupper; _tolower; _toupper; + __ctype_b; __ctype32_b; __ctype_tolower; __ctype_toupper; + _tolower; _toupper; # i* isalnum; isalpha; isascii; isblank; iscntrl; isdigit; isgraph; islower; diff --git a/iconv/gconv.h b/iconv/gconv.h index 3f787c5e1c..66c34aa928 100644 --- a/iconv/gconv.h +++ b/iconv/gconv.h @@ -1,4 +1,4 @@ -/* Copyright (C) 1997, 1998 Free Software Foundation, Inc. +/* Copyright (C) 1997, 1998, 1999 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or @@ -69,7 +69,7 @@ typedef void (*gconv_end_fct) __PMT ((struct gconv_step *)); struct gconv_step { struct gconv_loaded_object *shlib_handle; - const char *modname; + __const char *modname; int counter; @@ -104,7 +104,7 @@ struct gconv_step_data int is_last; /* Counter for number of invocations of the module function for this - desriptor. */ + descriptor. */ int invocation_counter; /* Flag whether this is an internal use of the module (in the mb*towc* diff --git a/iconv/skeleton.c b/iconv/skeleton.c index 4ed16d6e68..c124eb1e07 100644 --- a/iconv/skeleton.c +++ b/iconv/skeleton.c @@ -1,5 +1,5 @@ /* Skeleton for a conversion module. - Copyright (C) 1998 Free Software Foundation, Inc. + Copyright (C) 1998, 1999 Free Software Foundation, Inc. This file is part of the GNU C Library. Contributed by Ulrich Drepper , 1998. @@ -119,7 +119,7 @@ static int to_object; character set we we can define RESET_INPUT_BUFFER is necessary. */ #if !defined RESET_INPUT_BUFFER && !defined SAVE_RESET_STATE # if MIN_NEEDED_FROM == MAX_NEEDED_FROM && MIN_NEEDED_TO == MAX_NEEDED_TO -/* We have to used these `if's here since the compiler cannot know that +/* We have to use these `if's here since the compiler cannot know that (outbuf - outerr) is always divisible by MIN_NEEDED_TO. */ # define RESET_INPUT_BUFFER \ if (MIN_NEEDED_FROM % MIN_NEEDED_TO == 0) \ @@ -144,26 +144,25 @@ gconv_init (struct gconv_step *step) { /* Determine which direction. */ if (__strcasecmp (step->from_name, CHARSET_NAME) == 0) - step->data = &from_object; - else if (__strcasecmp (step->to_name, CHARSET_NAME) == 0) - step->data = &to_object; - else - return GCONV_NOCONV; - - if (step->data == &from_object) { + step->data = &from_object; + step->min_needed_from = MIN_NEEDED_FROM; step->max_needed_from = MAX_NEEDED_FROM; step->min_needed_to = MIN_NEEDED_TO; step->max_needed_to = MAX_NEEDED_TO; } - else + else if (__strcasecmp (step->to_name, CHARSET_NAME) == 0) { + step->data = &to_object; + step->min_needed_from = MIN_NEEDED_TO; step->max_needed_from = MAX_NEEDED_TO; step->min_needed_to = MIN_NEEDED_FROM; step->max_needed_to = MAX_NEEDED_FROM; } + else + return GCONV_NOCONV; #ifdef RESET_STATE step->stateful = 1; @@ -210,22 +209,17 @@ FUNCTION_NAME (struct gconv_step *step, struct gconv_step_data *data, dropped. */ if (do_flush) { - /* Call the steps down the chain if there are any. */ - if (data->is_last) - status = GCONV_OK; - else - { -#ifdef EMIT_SHIFT_TO_INIT - status = GCONV_OK; + status = GCONV_OK; - EMIT_SHIFT_TO_INIT; - - if (status == GCONV_OK) +#ifdef EMIT_SHIFT_TO_INIT + /* Emit the escape sequence to reset the state. */ + EMIT_SHIFT_TO_INIT; #endif - /* Give the modules below the same chance. */ - status = DL_CALL_FCT (fct, (next_step, next_data, NULL, NULL, - written, 1)); - } + /* Call the steps down the chain if there are any but only if we + successfully emitted the escape sequence. */ + if (status == GCONV_OK && ! data->is_last) + status = DL_CALL_FCT (fct, (next_step, next_data, NULL, NULL, + written, 1)); } else { @@ -271,7 +265,7 @@ FUNCTION_NAME (struct gconv_step *step, struct gconv_step_data *data, data->statep, step->data, &converted EXTRA_LOOP_ARGS); - /* If this is the last step leave the loop, there is nothgin + /* If this is the last step leave the loop, there is nothing we can do. */ if (data->is_last) { diff --git a/iconvdata/iso-2022-jp.c b/iconvdata/iso-2022-jp.c index 36465ccd45..a7ec09b32d 100644 --- a/iconvdata/iso-2022-jp.c +++ b/iconvdata/iso-2022-jp.c @@ -1,5 +1,5 @@ /* Conversion module for ISO-2022-JP. - Copyright (C) 1998 Free Software Foundation, Inc. + Copyright (C) 1998, 1999 Free Software Foundation, Inc. This file is part of the GNU C Library. Contributed by Ulrich Drepper , 1998. @@ -149,14 +149,14 @@ gconv_init (struct gconv_step *step) step->min_needed_from = MIN_NEEDED_FROM; step->max_needed_from = MAX_NEEDED_FROM; step->min_needed_to = MIN_NEEDED_TO; - step->max_needed_to = MIN_NEEDED_TO; + step->max_needed_to = MAX_NEEDED_TO; } else { step->min_needed_from = MIN_NEEDED_TO; step->max_needed_from = MAX_NEEDED_TO; step->min_needed_to = MIN_NEEDED_FROM; - step->max_needed_to = MIN_NEEDED_FROM + 2; + step->max_needed_to = MAX_NEEDED_FROM + 2; } /* Yes, this is a stateful encoding. */ diff --git a/include/wctype.h b/include/wctype.h index c76f50c866..f93ec64abc 100644 --- a/include/wctype.h +++ b/include/wctype.h @@ -1 +1,7 @@ +#ifndef _WCTYPE_H + #include + +extern int __iswspace __P ((wint_t __wc)); + +#endif diff --git a/io/fts.c b/io/fts.c index 4ce6527441..cf52d9e299 100644 --- a/io/fts.c +++ b/io/fts.c @@ -231,6 +231,7 @@ fts_close(sp) { register FTSENT *freep, *p; int saved_errno; + int retval = 0; /* * This still works if we haven't read anything -- the dummy structure @@ -259,15 +260,16 @@ fts_close(sp) (void)__close(sp->fts_rfd); } - /* Free up the stream pointer. */ - free(sp); - /* Set errno and return. */ if (!ISSET(FTS_NOCHDIR) && saved_errno) { __set_errno (saved_errno); - return (-1); + retval = -1; } - return (0); + + /* Free up the stream pointer. */ + free (sp); + + return retval; } /* diff --git a/manual/Makefile b/manual/Makefile index e0dad4792c..8eb4d5b69e 100644 --- a/manual/Makefile +++ b/manual/Makefile @@ -49,7 +49,7 @@ endif mkinstalldirs = $(..)scripts/mkinstalldirs chapters = $(addsuffix .texi, \ - intro errno memory ctype string mbyte locale \ + intro errno memory ctype string charset locale \ message search pattern io stdio llio filesys \ pipe socket terminal math arith time setjmp \ signal startup process job nss users sysinfo conf) @@ -74,7 +74,7 @@ libc.dvi: texinfo.tex # Generate the summary from the Texinfo source files for each chapter. summary.texi: stamp-summary ; stamp-summary: summary.awk $(filter-out summary.texi, $(texis)) - $(AWK) -f $^ | sort -t'^L' -df +0 -1 | tr '\014' '\012' > summary-tmp + $(AWK) -f $^ | sort -t' ' -df +0 -1 | tr '\014' '\012' > summary-tmp $(move-if-change) summary-tmp summary.texi touch $@ diff --git a/manual/chapters.texi b/manual/chapters.texi index a5a8a57903..bf7c4c01e0 100644 --- a/manual/chapters.texi +++ b/manual/chapters.texi @@ -3,7 +3,7 @@ @include memory.texi @include ctype.texi @include string.texi -@include mbyte.texi +@include charset.texi @include locale.texi @include message.texi @include search.texi @@ -27,6 +27,7 @@ @include users.texi @include sysinfo.texi @include conf.texi +@include ../crypt/crypt.texi @include ../linuxthreads/linuxthreads.texi @include lang.texi @include header.texi diff --git a/manual/charset.texi b/manual/charset.texi new file mode 100644 index 0000000000..6179128e3c --- /dev/null +++ b/manual/charset.texi @@ -0,0 +1,2846 @@ +@node Character Set Handling, Locales, String and Array Utilities, Top +@c %MENU% Support for extended character sets +@chapter Character Set Handling + +@ifnottex +@macro cal{text} +\text\ +@end macro +@end ifnottex + +Character sets used in the early days of computers had only six, seven, +or eight bits for each character. In no case more bits than would fit +into one byte which nowadays is almost exclusively @w{8 bits} wide. +This of course leads to several problems once not all characters needed +at one time can be represented by the up to 256 available characters. +This chapter shows the functionality which was added to the C library to +overcome this problem. + +@menu +* Extended Char Intro:: Introduction to Extended Characters. +* Charset Function Overview:: Overview about Character Handling + Functions. +* Restartable multibyte conversion:: Restartable multibyte conversion + Functions. +* Non-reentrant Conversion:: Non-reentrant Conversion Function. +* Generic Charset Conversion:: Generic Charset Conversion. +@end menu + + +@node Extended Char Intro +@section Introduction to Extended Characters + +To overcome the limitations of character sets with a 1:1 relation +between bytes and characters people came up with a variety of solutions. +The remainder of this section gives a few examples to help understanding +the design decision made while developing the functionality of the @w{C +library} to support them. + +@cindex internal representation +A distinction we have to make right away is between internal and +external representation. @dfn{Internal representation} means the +representation used by a program while keeping the text in memory. +External representations are used when text is stored or transmitted +through whatever communication channel. + +Traditionally there was no difference between the two representations. +It was equally comfortable and useful to use the same one-byte +representation internally and externally. This changes with more and +larger character sets. + +One of the problems to overcome with the internal representation is +handling text which were externally encoded using different character +sets. Assume a program which reads two texts and compares them using +some metric. The comparison can be usefully done only if the texts are +internally kept in a common format. + +@cindex wide character +For such a common format (@math{=} character set) eight bits are certainly +not enough anymore. So the smallest entity will have to grow: @dfn{wide +characters} will be used. Here instead of one byte one uses two or four +(three are not good to address in memory and more than four bytes seem +not to be necessary). + +@cindex Unicode +@cindex ISO 10646 +As shown in some other part of this manual +@c !!! Ahem, wide char string functions are not yet covered -- drepper +there exists a completely new family of functions which can handle texts +of this kinds in memory. The most commonly used character set for such +internal wide character representations are Unicode and @w{ISO 10646}. +The former is a subset of the later and used when wide characters are +chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the +@cindex UCS2 +@cindex UCS4 +encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4 +(@math{= 32} bits). + +To represent wide characters the @code{char} type is certainly not +suitable. For this reason the @w{ISO C} standard introduces a new type +which is designed to keep one character of a wide character string. To +maintain the similarity there is also a type corresponding to @code{int} +for those functions which take a single wide character. + +@comment stddef.h +@comment ISO +@deftp {Data type} wchar_t +This data type is used as the base type for wide character strings. +I.e., arrays of objects of this type are the equivalent of @code{char[]} +for multibyte character strings. The type is defined in @file{stddef.h}. + +The @w{ISO C89} standard, where this type was introduced, does not say +anything specific about the representation. It only requires that this +type is capable to store all elements of the basic character set. +Therefore it would be legitimate to define @code{wchar_t} and +@code{char}. This might make sense for embedded systems. + +But for GNU systems this type is always 32 bits wide. It is therefore +capable to represent all UCS4 value therefore covering all of @w{ISO +10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and +thereby follow Unicode very strictly. This is perfectly fine with the +standard but it also means that to represent all characters fro Unicode +and @w{ISO 10646} one has to use surrogate character which is in fact a +multi-wide-character encoding. But this contradicts the purpose of the +@code{wchar_t} type. +@end deftp + +@comment wchar.h +@comment ISO +@deftp {Data type} wint_t +@code{wint_t} is a data type used for parameters and variables which +contain a single wide character. As the name already suggests it is the +equivalent to @code{int} when using the normal @code{char} strings. The +types @code{wchar_t} and @code{wint_t} have often the same +representation if their size if 32 bits wide but if @code{wchar_t} is +defined as @code{char} the type @code{wint_t} must be defined as +@code{int} due to the parameter promotion. + +@pindex wchar.h +This type is defined in @file{wchar.h} and got introduced in the second +amendment to @w{ISO C 89}. +@end deftp + +As there are for the @code{char} data type there also exist macros +specifying the minimum and maximum value representable in an object of +type @code{wchar_t}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MIN +The macro @code{WCHAR_MIN} evaluates to the minimum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MAX +The macro @code{WCHAR_MIN} evaluates to the maximum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +Another special wide character value is the equivalent to @code{EOF}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WEOF +The macro @code{WEOF} evaluates to a constant expression of type +@code{wint_t} whose value is different from any member of the extended +character set. + +@code{WEOF} need not be the same value as @code{EOF} and unlike +@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like + +@smallexample +@{ + int c; + ... + while ((c = getc (fp)) < 0) + ... +@} +@end smallexample + +@noindent +has to be rewritten to explicitly use @code{WEOF} when wide characters +are used. + +@smallexample +@{ + wint_t c; + ... + while ((c = wgetc (fp)) != WEOF) + ... +@} +@end smallexample + +@pindex wchar.h +This macro was introduced in the second amendment to @w{ISO C89} and is +defined in @file{wchar.h}. +@end deftypevr + + +These internal representations present problems when it comes to storing +and transmitting them. Since a single wide character consists of more +than one byte they are effected by byte-ordering. I.e., machines with +different endianesses would see different value accessing the same data. +This also applies for communication protocols which are all byte-based +and therefore the sender has to decide about splitting the wide +character in bytes. A last but not least important point is that wide +characters often require more storage space than an customized byte +oriented character set. + +@cindex multibyte character +This is why most of the time an external encoding which is different +from the internal encoding is used if the later is UCS2 or UCS4. The +external encoding is byte-based and can be chosen appropriately for the +environment and for the texts to be handled. There exists a variety of +different character sets which can be used which is too much to be +handled completely here. We restrict ourself here to a description of +the major groups. All of the ASCII-based character sets fulfill one +requirement: they are ``filesystem safe''. This means that the +character @code{'/'} is used in the encoding @emph{only} to represent +itself. Things are a bit different for character like EBCDIC but if the +operation system does not understand EBCDIC directly the parameters to +system calls have to be converted first anyhow. + +@itemize @bullet +@item +The simplest character sets are one-byte character sets. There can be +only up to 256 characters (for @w{8 bit} character sets) which is not +sufficient to cover all languages but might be sufficient to handle a +specific text. Another reason to choose this is because of constraints +from interaction with other programs. + +@cindex ISO 2022 +@item +The @w{ISO 2022} standard defines a mechanism for extended character +sets where one character @emph{can} be represented by more than one +byte. This is achieved by associating a state with the text. Embedded +in the text can be characters which can be used to change the state. +Each byte in the text might have a different interpretation in each +state. The state might even influence whether a given byte stands for a +character on its own or whether it has to be combined with some more +bytes. + +@cindex EUC +@cindex SJIS +In most uses of @w{ISO 2022} the defined character sets do not allow +state changes which cover more than the next character. This has the +big advantage that whenever one can identify the beginning of the byte +sequence of a character one can interpret a text correctly. Examples of +character sets using this policy are the various EUC character sets +(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) +or SJIS (Shift JIS, a Japanese encoding). + +But there are also character sets using a state which is valid for more +than one character and has to be changed by another byte sequence. +Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. + +@item +@cindex ISO 6937 +Early attempts to fix 8 bit character sets for other languages using the +Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes +representing characters like the acute accent do not produce output on +there on. One has to combine them with other characters. E.g., the +byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by +lower-case `a') to get the ``small a with acute'' character. To get the +acute accent character on its on one has to write @code{0xc2 0x20} (the +non-spacing acute followed by a space). + +This type of characters sets is quite frequently used in embedded +systems such as video text. + +@item +@cindex UTF-8 +Instead of converting the Unicode or @w{ISO 10646} text used internally +it is often also sufficient to simply use an encoding different then +UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an +encoding: UTF-8. This encoding is able to represent all of @w{ISO +10464} 31 bits in a byte string of length one to seven. + +@cindex UTF-7 +There were a few other attempts to encode @w{ISO 10646} such as UTF-7 +but UTF-8 is today the only encoding which should be used. In fact, +UTF-8 will hopefully soon be the only external which has to be +supported. It proofs to be universally usable and the only disadvantage +is that it favor Latin languages very much by making the byte string +representation of other scripts (Cyrillic, Greek, Asian scripts) longer +than necessary if using a specific character set for these scripts. But +with methods like the Unicode compression scheme one can overcome these +problems and the ever growing memory and storage capacities do the rest. +@end itemize + +The question remaining now is: how to select the character set or +encoding to use. The answer is mostly: you cannot decide about it +yourself, it is decided by the developers of the system or the majority +of the users. Since the goal is interoperability one has to use +whatever the other people one works with use. If there are no +constraints the selection is based on the requirements the expected +circle of users will have. I.e., if a project is expected to only be +used in, say, Russia it is fine to use KOI8-R or a similar character +set. But if at the same time people from, say, Greek are participating +one should use a character set which allows all people to collaborate. + +A general advice here could be: go with the most general character set, +namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems +about users not being able to use their own language adequately are a +thing of the past. + +One final comment about the choice of the wide character representation +is necessary at this point. We have said above that the natural choice +is using Unicode or @w{ISO 10646}. This is not specified in any +standard, though. The @w{ISO C} standard does not specify anything +specific about the @code{wchar_t} type. There might be systems where +the developers decided differently. Therefore one should as much as +possible avoid making assumption about the wide character representation +although GNU systems will always work as described above. If the +programmer uses only the functions provided by the C library to handle +wide character strings there should not be any compatibility problems +with other systems. + +@node Charset Function Overview +@section Overview about Character Handling Functions + +A Unix @w{C library} contains three different sets of functions in two +families to handling character set conversion. The one function family +is specified in the @w{ISO C} standard and therefore is portable even +beyond the Unix world. + +The most commonly known set of functions, coming from the @w{ISO C89} +standard, is unfortunately the least useful one. In fact, these +functions should be avoided whenever possible, especially when +developing libraries (as opposed to applications). + +The second family o functions got introduced in the early Unix standards +(XPG2) and is still part of the latest and greatest Unix standard: +@w{Unix 98}. It is also the most powerful and useful set of functions. +But we will start with the functions defined in the second amendment to +@w{ISO C89}. + +@node Restartable multibyte conversion +@section Restartable Multibyte Conversion Functions + +The @w{ISO C} standard defines functions to convert strings from a +multibyte representation to wide character strings. There are a number +of peculiarities: + +@itemize @bullet +@item +The character set assumed for the multibyte encoding is not specified +as an argument to the functions. Instead the character set specified by +the @code{LC_CTYPE} category of the current locale is used; see +@ref{Locale Categories}. + +@item +The functions handling more than one character at a time require NUL +terminated strings as the argument. I.e., converting blocks of text +does not work unless one can add a NUL byte at an appropriate place. +The GNU C library contains some extensions the standard which allow +specifying a size but basically they also expect terminated strings. +@end itemize + +Despite these limitations the @w{ISO C} functions can very well be used +in many contexts. In graphical user interfaces, for instance, it is not +uncommon to have functions which require text to be displayed in a wide +character string if it is not simple ASCII. The text itself might come +from a file with translations and of course to user should decide about +the current locale which determines the translation and therefore also +the external encoding used. In such a situation (and many others) the +functions described here are perfect. If more freedom while performing +the conversion is necessary take a look at the @code{iconv} functions +(@pxref{Generic Charset Conversion}) + +@menu +* Selecting the Conversion:: Selecting the conversion and its properties. +* Keeping the state:: Representing the state of the conversion. +* Converting a Character:: Converting Single Characters. +* Converting Strings:: Converting Multibyte and Wide Character + Strings. +* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. +@end menu + +@node Selecting the Conversion +@subsection Selecting the conversion and its properties + +We already said above that the currently selected locale for the +@code{LC_CTYPE} category decides about the conversion which is performed +by the functions we are about to describe. Each locale uses its own +character set (given as an argument to @code{localedef}) and this is the +one assumed as the external multibyte encoding. The wide character +character set always is UCS4. So we can see here already where the +limitations of these conversion functions are. + +A characteristic of each multibyte character set is the maximum number +of bytes which can be necessary to represent one character. This +information is quite important when writing code which uses the +conversion functions. In the examples below we will see some examples. +The @w{ISO C} standard defines two macros which provide this information. + + +@comment limits.h +@comment ISO +@deftypevr Macro int MB_LEN_MAX +This macro specifies the maximum number of bytes in the multibyte +sequence for a single character in any of the supported locales. It is +a compile-time constant and it is defined in @file{limits.h}. +@pindex limits.h +@end deftypevr + +@comment stdlib.h +@comment ISO +@deftypevr Macro int MB_CUR_MAX +@code{MB_CUR_MAX} expands into a positive integer expression that is the +maximum number of bytes in a multibyte character in the current locale. +The value is never greater than @code{MB_LEN_MAX}. Unlike +@code{MB_LEN_MAX} this macro need not be a compile-time constant and in +fact, in the GNU C library it is not. + +@pindex stdlib.h +@code{MB_CUR_MAX} is defined in @file{stdlib.h}. +@end deftypevr + +Two different macros are necessary since strictly @w{ISO C89} compiles +do not allow variable length array definitions but still it is desirable +to avoid dynamic allocation. This incomplete piece of code shows the +problem: + +@smallexample +@{ + char buf[MB_LEN_MAX]; + ssize_t len = 0; + + while (! feof (fp)) + @{ + fread (&buf[len], 1, MB_CUR_MAX - len, fp); + /* @r{... process} buf */ + len -= used; + @} +@} +@end smallexample + +The code in the inner loop is expected to have always enough bytes in +the array @var{buf} to convert one multibyte character. The array +@var{buf} has to be sized statically since many compilers do not allow a +variable size. The @code{fread} call makes sure that always +@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no +problem if @code{MB_CUR_MAX} is not a compile-time constant. + + +@node Keeping the state +@subsection Representing the state of the conversion + +@cindex stateful +In the introduction of this chapter it was said that certain character +sets use a @dfn{stateful} encoding. I.e., the encoded values depend in +some way on the previous byte in the text. + +Since the conversion functions allow converting a text in more than one +step we must have a way to pass this information from one call of the +functions to another. + +@comment wchar.h +@comment ISO +@deftp {Data type} mbstate_t +@cindex shift state +A variable of type @code{mbstate_t} can contain all the information +about the @dfn{shift state} needed from one call to a conversion +function to another. + +@pindex wchar.h +This type is defined in @file{wchar.h}. It got introduced in the second +amendment to @w{ISO C89}. +@end deftp + +To use objects of this type the programmer has to define such objects +(normally as local variables on the stack) and pass a pointer to the +object to the conversion functions. This way the conversion function +can update the object if the current multibyte character set is +stateful. + +There is no specific function or initializer to put the state object in +any specific state. The rules are that the object should always +represent the initial state before the first use and this is achieved by +clearing the whole variable with code such as follows: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{from now on @var{state} can be used.} */ + ... +@} +@end smallexample + +When using the conversion functions to generate output it is often +necessary to test whether current state corresponds to the initial +state. This is necessary, for example, to decide whether or not to emit +escape sequences to set the state to the initial state at certain +sequence points. Communication protocols often require this. + +@comment wchar.h +@comment ISO +@deftypefun int mbsinit (const mbstate_t *@var{ps}) +This function determines whether the state object pointed to by @var{ps} +is in the initial state or not. If @var{ps} is no null pointer or the +object is in the initial state the return value is nonzero. Otherwise +it is zero. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Code using this function often looks similar to this: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{Use @var{state}.} */ + ... + if (! mbsinit (&state)) + @{ + /* @r{Emit code to return to initial state.} */ + fputs ("@r{whatever needed}", fp); + @} + ... +@} +@end smallexample + +@node Converting a Character +@subsection Converting Single Characters + +The most fundamental of the conversion functions are those dealing with +single characters. Please note that this does not always mean single +bytes. But since there is very often a subset of the multibyte +character set which consists of single byte sequences there are +functions to help with converting bytes. One very important and often +applicable scenario is where ASCII is a subpart of the multibyte +character set. I.e., all ASCII characters stand for itself and all +other characters have at least a first byte which is beyond the range +@math{0} to @math{127}. + +@comment wchar.h +@comment ISO +@deftypefun wint_t btowc (int @var{c}) +The @code{btowc} function (``byte to wide character'') converts a valid +single byte character in the initial shift state into the wide character +equivalent using the conversion rules from the currently selected locale +of the @code{LC_CTYPE} category. + +If @code{(unsigned char) @var{c}} is no valid single byte multibyte +character or if @var{c} is @code{EOF} the function returns @code{WEOF}. + +Please note the restriction of @var{c} being tested for validity only in +the initial shift state. There is no @code{mbstate_t} object used from +which the state information is taken and the function also does not use +any static state. + +@pindex wchar.h +This function was introduced in the second amendment of @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Despite the limitation that the single byte value always is interpreted +in the initial state this function is actually useful most of the time. +Most character are either entirely single-byte character sets or they +are extension to ASCII. But then it is possible to write code like this +(not that this specific example is useful): + +@smallexample +wchar_t * +itow (unsigned long int val) +@{ + static wchar_t buf[30]; + wchar_t *wcp = &buf[29]; + *wcp = L'\0'; + while (val != 0) + @{ + *--wcp = btowc ('0' + val % 10); + val /= 10; + @} + if (wcp == &buf[29]) + *--wcp = btowc ('0'); + return wcp; +@} +@end smallexample + +The question is why is it necessary to use such a complicated +implementation and not simply cast L'0' to a wide character. The answer +is that there is no guarantee that the compiler knows about the wide +character set used at runtime. Even if the wide character equivalent of +a given single-byte character is simply the equivalent to casting a +single-byte character to @code{wchar_t} this is no guarantee that this +is the case everywhere. + +There also is a function for the conversion in the other direction. + +@comment wchar.h +@comment ISO +@deftypefun int wctob (wint_t @var{c}) +The @code{wctob} function (``wide character to byte'') takes as the +paremeter a valid wide character. If the multibyte representation for +this character in the initial state is exactly one byte long the return +value of this function is this character. Otherwise the return value is +@code{EOF}. + +@pindex wchar.h +This function was introduced in the second amendment of @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +There are more general functions to convert single character from +multibyte representation to wide characters and vice versa. These +functions pose no limit on the length of the multibyte representation +and they also do not require it to be in the initial state. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) +@cindex stateful +The @code{mbrtowc} function (``multibyte restartable to wide +character'') converts the next multibyte character in the string pointed +to by @var{s} into a wide character and stores it in the wide character +string pointed to by @var{pwc}. The conversion is performed according +to the locale currently selected for the @code{LC_CTYPE} category. If +the character set for the locale is stateful the multibyte string is +interpreted in the state represented by the object pointed to by +@var{ps}. If @var{ps} is a null pointer an static, internal state +variable used only by the @code{mbrtowc} variable is used. + +If the next multibyte character corresponds to the NUL wide character +the return value of the function is @math{0} and the state object is +afterwards in the initial state. If the next @var{n} or fewer bytes +form a correct multibyte character the return value is the number of +bytes starting from @var{s} which form the multibyte character. The +conversion state is updated according to the bytes consumed in the +conversion. In both cases the wide character (either the @code{L'\0'} +or the one found in the conversion) is stored in the string pointer to +by @var{pwc} iff @var{pwc} is not null. + +If the first @var{n} bytes of the multibyte string possibly form a valid +multibyte character but there are more than @var{n} bytes needed to +complete it the return value of the function is @code{(size_t) -2} and +no value is stored. Please note that this can happen even if @var{n} +has a value greater or equal to @code{MB_CUR_MAX} since the input might +contain redundant shift sequences. + +If the first @code{n} bytes of the multibyte string cannot possibly +form a valid multibyte character also no value is stored, the global +variable i set to the value @code{EILSEQ} and the function return +@code{(size_t) -1}. The conversion state is afterwards undefined. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Using this function is straight forward. A function which copies a +multibyte string into a wide character string while at the same time +converting all lowercase character into uppercase could look like this +(this is not the final version, just an example; it has no error +checking and leaks sometimes memory): + +@smallexample +wchar_t * +mbstouwcs (const char *s) +@{ + size_t len = strlen (s); + wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); + wchar_t *wcp = result; + wchar_t tmp[1]; + mbstate_t state; + memset (&state, '\0', sizeof (state)); + size_t nbytes; + while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* Invalid input string. */ + return NULL; + *result++ = towupper (tmp[0]); + len -= nbytes; + s += nbytes; + @} + return result; +@} +@end smallexample + +The use of @code{mbrtowc} should be clear. A single wide character is +stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored +in the variable @var{nbytes}. In case the the conversion was successful +the uppercase variant of the wide character is stored in the +@var{result} array and the pointer to the input string and the number of +available bytes is adjusted. + +The only non-obvious thing about the function might be the way memory is +allocated for the result. The above code uses the fact that there can +never be more wide characters in the converted results than there are +bytes in the multibyte input string. This method yields to a +pessimistic guess about the size of the result and if many wide +character strings have to be constructed this way or the strings are +long, the extra memory required to store the wide character strings +might be significant. It would of course be possible to resize the +allocated memory block to the correct size before returning it. A +better solution might be to allocate just the right amount of space for +the result right away. Unfortunately there is no function to compute +the length of the wide character string directly from the multibyte +string. But there is a function which does part of the work. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) +The @code{mbrlen} function (``multibyte restartable length'') computes +the number of at most @var{n} bytes starting at @var{s} which form the +next valid and complete multibyte character. + +If the next multibyte character corresponds to the NUL wide character +the return value is @math{0}. If the next @var{n} bytes form a valid +multibyte character the number of bytes belonging to this multibyte +character byte sequence is returned. + +If the the first @var{n} bytes possibly form a valid multibyte +character but it is incomplete the return value is @code{(size_t) -2}. +Otherwise the multibyte character sequence is invalid and the return +value is @code{(size_t) -1}. + +The multibyte sequence is interpreted in the state represented by the +object pointer to by @var{ps}. If @var{ps} is a null pointer an state +object local to @code{mbrlen} is used. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +The tentative reader now will of course note that @code{mbrlen} can be +implemented as + +@smallexample +mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) +@end smallexample + +This is true and in fact is mentioned in the official specification. +Now, how can this function be used to determine the length of the wide +character string created from a multibyte character string? It is not +directly usable but we can define a function @code{mbslen} using it: + +@smallexample +size_t +mbslen (const char *s) +@{ + mbstate_t state; + size_t result = 0; + size_t nbytes; + memset (&state, '\0', sizeof (state)); + while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* @r{Something is wrong.} */ + return (size_t) -1; + s += nbytes; + ++result; + @} + return result; +@} +@end smallexample + +This function simply calls @code{mbrlen} for each multibyte character +in the string and counts the number of function calls. Please note that +we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} +call. This is OK since a) this value is larger then the length of the +longest multibyte character sequence and b) because we know that the +string @var{s} ends with a NIL byte which cannot be part of any other +multibyte character sequence but the one representing the NIL wide +character. Therefore the @code{mbrlen} function will never read invalid +memory. + +Now that this function is available (just to make this clear, this +function is @emph{not} part of the GNU C library) we can compute the +number of wide character required to store the converted multibyte +character string @var{s} using + +@smallexample +wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); +@end smallexample + +Please note that the @code{mbslen} function is quite inefficient. The +implementation of @code{mbstouwcs} implemented using @code{mbslen} would +have to perform the conversion of the multibyte character input string +twice and this conversion might be quite expensive. So it is necessary +to think about the consequences of using the easier but inprecise method +before doing the work twice. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) +The @code{wcrtomb} function (``wide character restartable to +multibyte'') converts a single wide character into a multibyte string +corresponding to that wide character. + +If @var{s} is a null pointer the resets the the state stored in the +objects pointer to by @var{ps} to the initial state. This can also be +achieved by a call like this: + +@smallexample +wcrtombs (temp_buf, L'\0', ps) +@end smallexample + +@noindent +since when @var{s} is a null pointer @code{wcrtomb} performs as if it +writes into an internal buffer which is guaranteed to be large enough. + +If @var{wc} is the NUL wide character @code{wcrtomb} emits, if +necessary, a shift sequence to get the state @var{ps} into the initial +state followed by a single NUL byte is stored in the string @var{s}. + +Otherwise a byte sequence (possibly including shift sequences) is +written into the string @var{s}. This of course only happens if +@var{wc} is a valid wide character, i.e., it has a multibyte +representation in the character set selected by locale of the +@code{LC_CTYPE} category. If @var{wc} is no valid wide character +nothing is stored in the strings @var{s}, @code{errno} is set to +@code{EILSEQ}, the conversion state in @var{ps} is undefined and the +return value is @code{(size_t) -1}. + +If no error occurred the function returns the number of bytes stored in +the string @var{s}. This includes all byte representing shift +sequences. + +One word about the interface of the function: there is no parameter +specifying the length of the array @var{s}. Instead the function +assumes that there are at least @code{MB_CUR_MAX} bytes available since +this is the maximum length of any byte sequence representing a single +character. So the caller has to make sure that there is enough space +available, otherwise buffer overruns can occur. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +Using this function is as easy as using @code{mbrtowc}. The following +example appends a wide character string to a multibyte character string. +Again, the code is not really useful, it is simply here to demonstrate +the use and some problems. + +@smallexample +char * +mbscatwc (char *s, size_t len, const wchar_t *ws) +@{ + mbstate_t state; + char *wp = strchr (s, '\0'); + len -= wp - s; + memset (&state, '\0', sizeof (state)); + do + @{ + size_t nbytes; + if (len < MB_CUR_LEN) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + nbytes = wcrtomb (wp, *ws, &state); + if (nbytes == (size_t) -1) + /* @r{Error in the conversion.} */ + return NULL; + len -= nbytes; + wp += nbytes; + @} + while (*ws++ != L'\0'); + return s; +@} +@end smallexample + +First the function has to find the end of the string currently in the +array @var{s}. The @code{strchr} call does this very efficiently since a +requirement for multibyte character representations is that the NUL byte +never is used except to represent itself (and in this context, the end +of the string). + +After initializing the state object the loop is entered where the first +task is to make sure there is enough room in the array @var{s}. We +abort if there are not at least @code{MB_CUR_LEN} bytes available. This +is not always optimal but we have no other choice. We might have less +than @code{MB_CUR_LEN} bytes available but the next multibyte character +might also be only one byte long. At the time the @code{wcrtomb} call +returns it is too late to decide whether the buffer was large enough or +not. If this solution is really unsuitable there is a very slow but +more accurate solution. + +@smallexample + ... + if (len < MB_CUR_LEN) + @{ + mbstate_t temp_state; + memcpy (&temp_state, &state, sizeof (state)); + if (wcrtomb (NULL, *ws, &temp_state) > len) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + @} + ... +@end smallexample + +Here we do perform the conversion which might overflow the buffer so +that we are afterwards in the position to make an exact decision about +the buffer size. Please note the @code{NULL} argument for the +destination buffer in the new @code{wcrtomb} call; since we are not +interested in the result at this point this is a nice way to express +this. The most unusual thing about this piece of code certainly is the +duplication of the conversion state object. But think about it: if a +change of the state is necessary to emit the next multibyte character we +want to have the same shift state change performed in the real +conversion. Therefore we have to preserve the initial shift state +information. + +There are certainly many more and even better solutions to this problem. +This example is only meant for educational purposes. + +@node Converting Strings +@subsection Converting Multibyte and Wide Character Strings + +The functions described in the previous section only convert a single +character at a time. Most operations to be performed in real-world +programs include strings and therefore the @w{ISO C} standard also +defines conversions on entire strings. The defined set of functions is +quite limited, though. Therefore contains the GNU C library a few +extensions which are necessary in some important situations. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsrtowcs} function (``multibyte string restartable to wide +character string'') converts an NUL terminated multibyte character +string at @code{*@var{src}} into an equivalent wide character string, +including the NUL wide character at the end. The conversion is started +using the state information from the object pointed to by @var{ps} or +from an internal object of @code{mbsrtowcs} if @var{ps} is a null +pointer. Before returning the state object to match the state after the +last converted character. The state is the initial state if the +terminating NUL byte is reached and converted. + +If @var{dst} is not a null pointer the result is stored in the array +pointed to by @var{dst}, otherwise the conversion result is not +available since it is stored in an internal buffer. + +If @var{len} wide characters are stored in the array @var{dst} before +reaching the end of the input string the conversion stops and @var{len} +is returned. If @var{dst} is a null pointer @var{len} is never checked. + +Another reason for a premature return from the function call is if the +input string contains an invalid multibyte sequence. In this case the +global variable @code{errno} is set to @code{EILSEQ} and the function +returns @code{(size_t) -1}. + +@c XXX The ISO C9x draft seems to have a problem here. It says that PS +@c is not updated if DST is NULL. This is not said straight forward and +@c none of the other functions is described like this. It would make sense +@c to define the function this way but I don't think it is meant like this. + +In all other cases the function returns the number of wide characters +converted during this call. If @var{dst} is not null @code{mbsrtowcs} +stores in the pointer pointed to by @var{src} a null pointer (if the NUL +byte in the input string was reached) or the address of the byte +following the last converted multibyte character. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +The definition of this function has one limitation which has to be +understood. The requirement that @var{dst} has to be a NUL terminated +string provides problems if one wants to convert buffers with text. A +buffer is normally no collection of NUL terminated strings but instead a +continuous collection of lines, separated by newline characters. Now +assume a function to convert one line from a buffer is needed. Since +the line is not NUL terminated the source pointer cannot directly point +into the unmodified text buffer. This means, either one inserts the NUL +byte at the appropriate place for the time of the @code{mbsrtowcs} +function call (which is not doable for a read-only buffer or in a +multi-threaded application) or one copies the line in an extra buffer +where it can be terminated by a NUL byte. Note that it is not in +general possible to limit the number of characters to convert by setting +the parameter @var{len} to any specific value. Since it is not known +how many bytes each multibyte character sequence is in length one always +could do only a guess. + +@cindex stateful +There is still a problem with the method of NUL-terminating a line right +after the newline character which could lead to very strange results. +As said in the description of the @var{mbsrtowcs} function above the +conversion state is guaranteed to be in the initial shift state after +processing the NUL byte at the end of the input string. But this NUL +byte is not really part of the text. I.e., the conversion state after +the newline in the original text could be something different than the +initial shift state and therefore the first character of the next line +is encoded using this state. But the state in question is never +accessible to the user since the conversion stops after the NUL byte. +Fortunately most stateful character sets in use today require that the +shift state after a newline is the initial state but this is no +guarantee. Therefore simply NUL terminating a piece of a running text +is not always the adequate solution. + +The generic conversion +@comment XXX reference to iconv +interface does not have this limitation (it simply works on buffers, not +strings) but there is another way. The GNU C library contains a set of +functions why take additional parameters specifying maximal number of +bytes which are consumed from the input string. This way the problem of +above's example could be solved by determining the line length and +passing this length to the function. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsrtombs} function (``wide character string restartable to +multibyte string'') converts the NUL terminated wide character string at +@code{*@var{src}} into an equivalent multibyte character string and +stores the result in the array pointed to by @var{dst}. The NUL wide +character is also converted. The conversion starts in the state +described in the object pointed to by @var{ps} or by a state object +locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If +@var{dst} is a null pointer the conversion is performed as usual but the +result is not available. If all characters of the input string were +successfully converted and if @var{dst} is not a null pointer the +pointer pointed to by @var{src} gets assigned a null pointer. + +If one of the wide characters in the input string has no valid multibyte +character equivalent the conversion stops early, sets the global +variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. + +Another reason for a premature stop is if @var{dst} is not a null +pointer and the next converted character would require more than +@var{len} bytes in total to the array @var{dst}. In this case (and if +@var{dest} is not a null pointer) the pointer pointed to by @var{src} is +assigned a value pointing to the wide character right after the last one +successfully converted. + +Except in the case of an encoding error the return value of the function +is the number of bytes in all the multibyte character sequences stored +in @var{dst}. Before returning the state in the object pointed to by +@var{ps} (or the internal object in case @var{ps} is a null pointer) is +updated to reflect the state after the last conversion. The state is +the initial shift state in case the terminating NUL wide character was +converted. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +The restriction mentions above for the @code{mbsrtowcs} function applies +also here. There is no possibility to directly control the number of +input characters. One has to place the NUL wide character at the +correct place or control the consumed input indirectly via the available +output array size (the @var{len} parameter). + +@comment wchar.h +@comment GNU +@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} +function. All the parameters are the same except for @var{nmc} which is +new. The return value is the same as for @code{mbsrtowcs}. + +This new parameter specifies how many bytes at most can be used from the +multibyte character string. I.e., the multibyte character string +@code{*@var{src}} need not be NUL terminated. But if a NUL byte is +found within the @var{nmc} first bytes of the string the conversion +stops here. + +This function is a GNU extensions. It is meant to work around the +problems mentioned above. Now it is possible to convert buffer with +multibyte character text piece for piece without having to care about +inserting NUL bytes and the effect of NUL bytes on the conversion state. +@end deftypefun + +A function to convert a multibyte string into a wide character string +and display it could be written like this (this is no really useful +example): + +@smallexample +void +showmbs (const char *src, FILE *fp) +@{ + mbstate_t state; + int cnt = 0; + memset (&state, '\0', sizeof (state)); + while (1) + @{ + wchar_t linebuf[100]; + const char *endp = strchr (src, '\n'); + size_t n; + + /* @r{Exit if there is no more line.} */ + if (endp == NULL) + break; + + n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); + linebuf[n] = L'\0'; + fprintf (fp, "line %d: \"%S\"\n", linebuf); + @} +@} +@end smallexample + +There is no more problem with the state after a call to +@code{mbsnrtowcs}. Since we don't insert characters in the strings +which were not in there right from the beginning and we use @var{state} +only for the conversion of the given buffer there is no problem with +mixing the state up. + +@comment wchar.h +@comment GNU +@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsnrtombs} function implements the conversion from wide +character strings to multibyte character strings. It is similar to +@code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra +parameter which specifies the length of the input string. + +No more than @var{nwc} wide characters from the input string +@code{*@var{src}} are converted. If the input string contains a NUL +wide character in the first @var{nwc} character to conversion stops at +this place. + +This function is a GNU extension and just like @code{mbsnrtowcs} is +helps in situations where no NUL terminated input strings are available. +@end deftypefun + + +@node Multibyte Conversion Example +@subsection A Complete Multibyte Conversion Example + +The example programs given in the last sections are only brief and do +not cont