14 files changed, 3432 insertions, 45 deletions
diff --git a/manual/Makefile b/manual/Makefile
index e0dad4792c..8eb4d5b69e 100644
--- a/manual/Makefile
+++ b/manual/Makefile
@@ -49,7 +49,7 @@ endif
 mkinstalldirs = $(..)scripts/mkinstalldirs
 
 chapters = $(addsuffix .texi, \
-		       intro errno memory ctype string mbyte locale	\
+		       intro errno memory ctype string charset locale	\
 		       message search pattern io stdio llio filesys	\
 		       pipe socket terminal math arith time setjmp	\
 		       signal startup process job nss users sysinfo conf)
@@ -74,7 +74,7 @@ libc.dvi: texinfo.tex
 # Generate the summary from the Texinfo source files for each chapter.
 summary.texi: stamp-summary ;
 stamp-summary: summary.awk $(filter-out summary.texi, $(texis))
-	$(AWK) -f $^ | sort -t'^L' -df +0 -1 | tr '\014' '\012' > summary-tmp
+	$(AWK) -f $^ | sort -t'' -df +0 -1 | tr '\014' '\012' > summary-tmp
 	$(move-if-change) summary-tmp summary.texi
 	touch $@
 
diff --git a/manual/chapters.texi b/manual/chapters.texi
index a5a8a57903..bf7c4c01e0 100644
--- a/manual/chapters.texi
+++ b/manual/chapters.texi
@@ -3,7 +3,7 @@
 @include memory.texi
 @include ctype.texi
 @include string.texi
-@include mbyte.texi
+@include charset.texi
 @include locale.texi
 @include message.texi
 @include search.texi
@@ -27,6 +27,7 @@
 @include users.texi
 @include sysinfo.texi
 @include conf.texi
+@include ../crypt/crypt.texi
 @include ../linuxthreads/linuxthreads.texi
 @include lang.texi
 @include header.texi
diff --git a/manual/charset.texi b/manual/charset.texi
new file mode 100644
index 0000000000..6179128e3c
--- /dev/null
+++ b/manual/charset.texi
@@ -0,0 +1,2846 @@
+@node Character Set Handling, Locales, String and Array Utilities, Top
+@c %MENU% Support for extended character sets
+@chapter Character Set Handling
+
+@ifnottex
+@macro cal{text}
+\text\
+@end macro
+@end ifnottex
+
+Character sets used in the early days of computers had only six, seven,
+or eight bits for each character.  In no case more bits than would fit
+into one byte which nowadays is almost exclusively @w{8 bits} wide.
+This of course leads to several problems once not all characters needed
+at one time can be represented by the up to 256 available characters.
+This chapter shows the functionality which was added to the C library to
+overcome this problem.
+
+@menu
+* Extended Char Intro::              Introduction to Extended Characters.
+* Charset Function Overview::        Overview about Character Handling
+                                      Functions.
+* Restartable multibyte conversion:: Restartable multibyte conversion
+                                      Functions.
+* Non-reentrant Conversion::         Non-reentrant Conversion Function.
+* Generic Charset Conversion::       Generic Charset Conversion.
+@end menu
+
+
+@node Extended Char Intro
+@section Introduction to Extended Characters
+
+To overcome the limitations of character sets with a 1:1 relation
+between bytes and characters people came up with a variety of solutions.
+The remainder of this section gives a few examples to help understanding
+the design decision made while developing the functionality of the @w{C
+library} to support them.
+
+@cindex internal representation
+A distinction we have to make right away is between internal and
+external representation.  @dfn{Internal representation} means the
+representation used by a program while keeping the text in memory.
+External representations are used when text is stored or transmitted
+through whatever communication channel.
+
+Traditionally there was no difference between the two representations.
+It was equally comfortable and useful to use the same one-byte
+representation internally and externally.  This changes with more and
+larger character sets.
+
+One of the problems to overcome with the internal representation is
+handling text which were externally encoded using different character
+sets.  Assume a program which reads two texts and compares them using
+some metric.  The comparison can be usefully done only if the texts are
+internally kept in a common format.
+
+@cindex wide character
+For such a common format (@math{=} character set) eight bits are certainly
+not enough anymore.  So the smallest entity will have to grow: @dfn{wide
+characters} will be used.  Here instead of one byte one uses two or four
+(three are not good to address in memory and more than four bytes seem
+not to be necessary).
+
+@cindex Unicode
+@cindex ISO 10646
+As shown in some other part of this manual
+@c !!! Ahem, wide char string functions are not yet covered -- drepper
+there exists a completely new family of functions which can handle texts
+of this kinds in memory.  The most commonly used character set for such
+internal wide character representations are Unicode and @w{ISO 10646}.
+The former is a subset of the later and used when wide characters are
+chosen to by 2 bytes (@math{= 16} bits) wide.  The standard names of the
+@cindex UCS2
+@cindex UCS4
+encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
+(@math{= 32} bits).
+
+To represent wide characters the @code{char} type is certainly not
+suitable.  For this reason the @w{ISO C} standard introduces a new type
+which is designed to keep one character of a wide character string.  To
+maintain the similarity there is also a type corresponding to @code{int}
+for those functions which take a single wide character.
+
+@comment stddef.h
+@comment ISO
+@deftp {Data type} wchar_t
+This data type is used as the base type for wide character strings.
+I.e., arrays of objects of this type are the equivalent of @code{char[]}
+for multibyte character strings.  The type is defined in @file{stddef.h}.
+
+The @w{ISO C89} standard, where this type was introduced, does not say
+anything specific about the representation.  It only requires that this
+type is capable to store all elements of the basic character set.
+Therefore it would be legitimate to define @code{wchar_t} and
+@code{char}.  This might make sense for embedded systems.
+
+But for GNU systems this type is always 32 bits wide.  It is therefore
+capable to represent all UCS4 value therefore covering all of @w{ISO
+10646}.  Some Unix systems define @code{wchar_t} as a 16 bit type and
+thereby follow Unicode very strictly.  This is perfectly fine with the
+standard but it also means that to represent all characters fro Unicode
+and @w{ISO 10646} one has to use surrogate character which is in fact a
+multi-wide-character encoding.  But this contradicts the purpose of the
+@code{wchar_t} type.
+@end deftp
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} wint_t
+@code{wint_t} is a data type used for parameters and variables which
+contain a single wide character.  As the name already suggests it is the
+equivalent to @code{int} when using the normal @code{char} strings.  The
+types @code{wchar_t} and @code{wint_t} have often the same
+representation if their size if 32 bits wide but if @code{wchar_t} is
+defined as @code{char} the type @code{wint_t} must be defined as
+@code{int} due to the parameter promotion.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h} and got introduced in the second
+amendment to @w{ISO C 89}.
+@end deftp
+
+As there are for the @code{char} data type there also exist macros
+specifying the minimum and maximum value representable in an object of
+type @code{wchar_t}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MIN
+The macro @code{WCHAR_MIN} evaluates to the minimum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MAX
+The macro @code{WCHAR_MIN} evaluates to the maximum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+Another special wide character value is the equivalent to @code{EOF}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WEOF
+The macro @code{WEOF} evaluates to a constant expression of type
+@code{wint_t} whose value is different from any member of the extended
+character set.
+
+@code{WEOF} need not be the same value as @code{EOF} and unlike
+@code{EOF} it also need @emph{not} be negative.  I.e., sloppy code like
+
+@smallexample
+@{
+  int c;
+  ...
+  while ((c = getc (fp)) < 0)
+    ...
+@}
+@end smallexample
+
+@noindent
+has to be rewritten to explicitly use @code{WEOF} when wide characters
+are used.
+
+@smallexample
+@{
+  wint_t c;
+  ...
+  while ((c = wgetc (fp)) != WEOF)
+    ...
+@}
+@end smallexample
+
+@pindex wchar.h
+This macro was introduced in the second amendment to @w{ISO C89} and is
+defined in @file{wchar.h}.
+@end deftypevr
+
+
+These internal representations present problems when it comes to storing
+and transmitting them.  Since a single wide character consists of more
+than one byte they are effected by byte-ordering.  I.e., machines with
+different endianesses would see different value accessing the same data.
+This also applies for communication protocols which are all byte-based
+and therefore the sender has to decide about splitting the wide
+character in bytes.  A last but not least important point is that wide
+characters often require more storage space than an customized byte
+oriented character set.
+
+@cindex multibyte character
+This is why most of the time an external encoding which is different
+from the internal encoding is used if the later is UCS2 or UCS4.  The
+external encoding is byte-based and can be chosen appropriately for the
+environment and for the texts to be handled.  There exists a variety of
+different character sets which can be used which is too much to be
+handled completely here.  We restrict ourself here to a description of
+the major groups.  All of the ASCII-based character sets fulfill one
+requirement: they are ``filesystem safe''.  This means that the
+character @code{'/'} is used in the encoding @emph{only} to represent
+itself.  Things are a bit different for character like EBCDIC but if the
+operation system does not understand EBCDIC directly the parameters to
+system calls have to be converted first anyhow.
+
+@itemize @bullet
+@item
+The simplest character sets are one-byte character sets.  There can be
+only up to 256 characters (for @w{8 bit} character sets) which is not
+sufficient to cover all languages but might be sufficient to handle a
+specific text.  Another reason to choose this is because of constraints
+from interaction with other programs.
+
+@cindex ISO 2022
+@item
+The @w{ISO 2022} standard defines a mechanism for extended character
+sets where one character @emph{can} be represented by more than one
+byte.  This is achieved by associating a state with the text.  Embedded
+in the text can be characters which can be used to change the state.
+Each byte in the text might have a different interpretation in each
+state.  The state might even influence whether a given byte stands for a
+character on its own or whether it has to be combined with some more
+bytes.
+
+@cindex EUC
+@cindex SJIS
+In most uses of @w{ISO 2022} the defined character sets do not allow
+state changes which cover more than the next character.  This has the
+big advantage that whenever one can identify the beginning of the byte
+sequence of a character one can interpret a text correctly.  Examples of
+character sets using this policy are the various EUC character sets
+(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+or SJIS (Shift JIS, a Japanese encoding).
+
+But there are also character sets using a state which is valid for more
+than one character and has to be changed by another byte sequence.
+Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
+
+@item
+@cindex ISO 6937
+Early attempts to fix 8 bit character sets for other languages using the
+Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
+representing characters like the acute accent do not produce output on
+there on.  One has to combine them with other characters.  E.g., the
+byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by
+lower-case `a') to get the ``small a with acute'' character.  To get the
+acute accent character on its on one has to write @code{0xc2 0x20} (the
+non-spacing acute followed by a space).
+
+This type of characters sets is quite frequently used in embedded
+systems such as video text.
+
+@item
+@cindex UTF-8
+Instead of converting the Unicode or @w{ISO 10646} text used internally
+it is often also sufficient to simply use an encoding different then
+UCS2/UCS4.  The Unicode and @w{ISO 10646} standards even specify such an
+encoding: UTF-8.  This encoding is able to represent all of @w{ISO
+10464} 31 bits in a byte string of length one to seven.
+
+@cindex UTF-7
+There were a few other attempts to encode @w{ISO 10646} such as UTF-7
+but UTF-8 is today the only encoding which should be used.  In fact,
+UTF-8 will hopefully soon be the only external which has to be
+supported.  It proofs to be universally usable and the only disadvantage
+is that it favor Latin languages very much by making the byte string
+representation of other scripts (Cyrillic, Greek, Asian scripts) longer
+than necessary if using a specific character set for these scripts.  But
+with methods like the Unicode compression scheme one can overcome these
+problems and the ever growing memory and storage capacities do the rest.
+@end itemize
+
+The question remaining now is: how to select the character set or
+encoding to use.  The answer is mostly: you cannot decide about it
+yourself, it is decided by the developers of the system or the majority
+of the users.  Since the goal is interoperability one has to use
+whatever the other people one works with use.  If there are no
+constraints the selection is based on the requirements the expected
+circle of users will have.  I.e., if a project is expected to only be
+used in, say, Russia it is fine to use KOI8-R or a similar character
+set.  But if at the same time people from, say, Greek are participating
+one should use a character set which allows all people to collaborate.
+
+A general advice here could be: go with the most general character set,
+namely @w{ISO 10646}.  Use UTF-8 as the external encoding and problems
+about users not being able to use their own language adequately are a
+thing of the past.
+
+One final comment about the choice of the wide character representation
+is necessary at this point.  We have said above that the natural choice
+is using Unicode or @w{ISO 10646}.  This is not specified in any
+standard, though.  The @w{ISO C} standard does not specify anything
+specific about the @code{wchar_t} type.  There might be systems where
+the developers decided differently.  Therefore one should as much as
+possible avoid making assumption about the wide character representation
+although GNU systems will always work as described above.  If the
+programmer uses only the functions provided by the C library to handle
+wide character strings there should not be any compatibility problems
+with other systems.
+
+@node Charset Function Overview
+@section Overview about Character Handling Functions
+
+A Unix @w{C library} contains three different sets of functions in two
+families to handling character set conversion.  The one function family
+is specified in the @w{ISO C} standard and therefore is portable even
+beyond the Unix world.
+
+The most commonly known set of functions, coming from the @w{ISO C89}
+standard, is unfortunately the least useful one.  In fact, these
+functions should be avoided whenever possible, especially when
+developing libraries (as opposed to applications).
+
+The second family o functions got introduced in the early Unix standards
+(XPG2) and is still part of the latest and greatest Unix standard:
+@w{Unix 98}.  It is also the most powerful and useful set of functions.
+But we will start with the functions defined in the second amendment to
+@w{ISO C89}.
+
+@node Restartable multibyte conversion
+@section Restartable Multibyte Conversion Functions
+
+The @w{ISO C} standard defines functions to convert strings from a
+multibyte representation to wide character strings.  There are a number
+of peculiarities:
+
+@itemize @bullet
+@item
+The character set assumed for the multibyte encoding is not specified
+as an argument to the functions.  Instead the character set specified by
+the @code{LC_CTYPE} category of the current locale is used; see
+@ref{Locale Categories}.
+
+@item
+The functions handling more than one character at a time require NUL
+terminated strings as the argument.  I.e., converting blocks of text
+does not work unless one can add a NUL byte at an appropriate place.
+The GNU C library contains some extensions the standard which allow
+specifying a size but basically they also expect terminated strings.
+@end itemize
+
+Despite these limitations the @w{ISO C} functions can very well be used
+in many contexts.  In graphical user interfaces, for instance, it is not
+uncommon to have functions which require text to be displayed in a wide
+character string if it is not simple ASCII.  The text itself might come
+from a file with translations and of course to user should decide about
+the current locale which determines the translation and therefore also
+the external encoding used.  In such a situation (and many others) the
+functions described here are perfect.  If more freedom while performing
+the conversion is necessary take a look at the @code{iconv} functions
+(@pxref{Generic Charset Conversion})
+
+@menu
+* Selecting the Conversion::     Selecting the conversion and its properties.
+* Keeping the state::            Representing the state of the conversion.
+* Converting a Character::       Converting Single Characters.
+* Converting Strings::           Converting Multibyte and Wide Character
+                                  Strings.
+* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
+@end menu
+
+@node Selecting the Conversion
+@subsection Selecting the conversion and its properties
+
+We already said above that the currently selected locale for the
+@code{LC_CTYPE} category decides about the conversion which is performed
+by the functions we are about to describe.  Each locale uses its own
+character set (given as an argument to @code{localedef}) and this is the
+one assumed as the external multibyte encoding.  The wide character
+character set always is UCS4.  So we can see here already where the
+limitations of these conversion functions are.
+
+A characteristic of each multibyte character set is the maximum number
+of bytes which can be necessary to represent one character.  This
+information is quite important when writing code which uses the
+conversion functions.  In the examples below we will see some examples.
+The @w{ISO C} standard defines two macros which provide this information.
+
+
+@comment limits.h
+@comment ISO
+@deftypevr Macro int MB_LEN_MAX
+This macro specifies the maximum number of bytes in the multibyte
+sequence for a single character in any of the supported locales.  It is
+a compile-time constant and it is defined in @file{limits.h}.
+@pindex limits.h
+@end deftypevr
+
+@comment stdlib.h
+@comment ISO
+@deftypevr Macro int MB_CUR_MAX
+@code{MB_CUR_MAX} expands into a positive integer expression that is the
+maximum number of bytes in a multibyte character in the current locale.
+The value is never greater than @code{MB_LEN_MAX}.  Unlike
+@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
+fact, in the GNU C library it is not.
+
+@pindex stdlib.h
+@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
+@end deftypevr
+
+Two different macros are necessary since strictly @w{ISO C89} compiles
+do not allow variable length array definitions but still it is desirable
+to avoid dynamic allocation.  This incomplete piece of code shows the
+problem:
+
+@smallexample
+@{
+  char buf[MB_LEN_MAX];
+  ssize_t len = 0;
+
+  while (! feof (fp))
+    @{
+      fread (&buf[len], 1, MB_CUR_MAX - len, fp);
+      /* @r{... process} buf */
+      len -= used;
+    @}
+@}
+@end smallexample
+
+The code in the inner loop is expected to have always enough bytes in
+the array @var{buf} to convert one multibyte character.  The array
+@var{buf} has to be sized statically since many compilers do not allow a
+variable size.  The @code{fread} call makes sure that always
+@code{MB_CUR_MAX} bytes are available in @var{buf}.  Note that it is no
+problem if @code{MB_CUR_MAX} is not a compile-time constant.
+
+
+@node Keeping the state
+@subsection Representing the state of the conversion
+
+@cindex stateful
+In the introduction of this chapter it was said that certain character
+sets use a @dfn{stateful} encoding.  I.e., the encoded values depend in
+some way on the previous byte in the text.
+
+Since the conversion functions allow converting a text in more than one
+step we must have a way to pass this information from one call of the
+functions to another.
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} mbstate_t
+@cindex shift state
+A variable of type @code{mbstate_t} can contain all the information
+about the @dfn{shift state} needed from one call to a conversion
+function to another.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h}.  It got introduced in the second
+amendment to @w{ISO C89}.
+@end deftp
+
+To use objects of this type the programmer has to define such objects
+(normally as local variables on the stack) and pass a pointer to the
+object to the conversion functions.  This way the conversion function
+can update the object if the current multibyte character set is
+stateful.
+
+There is no specific function or initializer to put the state object in
+any specific state.  The rules are that the object should always
+represent the initial state before the first use and this is achieved by
+clearing the whole variable with code such as follows:
+
+@smallexample
+@{
+  mbstate_t state;
+  memset (&state, '\0', sizeof (state));
+  /* @r{from now on @var{state} can be used.}  */
+  ...
+@}
+@end smallexample
+
+When using the conversion functions to generate output it is often
+necessary to test whether current state corresponds to the initial
+state.  This is necessary, for example, to decide whether or not to emit
+escape sequences to set the state to the initial state at certain
+sequence points.  Communication protocols often require this.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int mbsinit (const mbstate_t *@var{ps})
+This function determines whether the state object pointed to by @var{ps}
+is in the initial state or not.  If @var{ps} is no null pointer or the
+object is in the initial state the return value is nonzero.  Otherwise
+it is zero.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Code using this function often looks similar to this:
+
+@smallexample
+@{
+  mbstate_t state;
+  memset (&state, '\0', sizeof (state));
+  /* @r{Use @var{state}.}  */
+  ...<