aboutsummaryrefslogtreecommitdiff
path: root/manual
diff options
context:
space:
mode:
Diffstat (limited to 'manual')
-rw-r--r--manual/Makefile4
-rw-r--r--manual/chapters.texi3
-rw-r--r--manual/charset.texi2846
-rw-r--r--manual/ctype.texi521
-rw-r--r--manual/filesys.texi4
-rw-r--r--manual/intro.texi2
-rw-r--r--manual/lang.texi2
-rw-r--r--manual/libc.texinfo4
-rw-r--r--manual/locale.texi6
-rw-r--r--manual/memory.texi3
-rw-r--r--manual/stdio.texi8
-rw-r--r--manual/string.texi2
-rw-r--r--manual/texis2
-rw-r--r--manual/top-menu.texi70
14 files changed, 3432 insertions, 45 deletions
diff --git a/manual/Makefile b/manual/Makefile
index e0dad4792c..8eb4d5b69e 100644
--- a/manual/Makefile
+++ b/manual/Makefile
@@ -49,7 +49,7 @@ endif
mkinstalldirs = $(..)scripts/mkinstalldirs
chapters = $(addsuffix .texi, \
- intro errno memory ctype string mbyte locale \
+ intro errno memory ctype string charset locale \
message search pattern io stdio llio filesys \
pipe socket terminal math arith time setjmp \
signal startup process job nss users sysinfo conf)
@@ -74,7 +74,7 @@ libc.dvi: texinfo.tex
# Generate the summary from the Texinfo source files for each chapter.
summary.texi: stamp-summary ;
stamp-summary: summary.awk $(filter-out summary.texi, $(texis))
- $(AWK) -f $^ | sort -t'^L' -df +0 -1 | tr '\014' '\012' > summary-tmp
+ $(AWK) -f $^ | sort -t' ' -df +0 -1 | tr '\014' '\012' > summary-tmp
$(move-if-change) summary-tmp summary.texi
touch $@
diff --git a/manual/chapters.texi b/manual/chapters.texi
index a5a8a57903..bf7c4c01e0 100644
--- a/manual/chapters.texi
+++ b/manual/chapters.texi
@@ -3,7 +3,7 @@
@include memory.texi
@include ctype.texi
@include string.texi
-@include mbyte.texi
+@include charset.texi
@include locale.texi
@include message.texi
@include search.texi
@@ -27,6 +27,7 @@
@include users.texi
@include sysinfo.texi
@include conf.texi
+@include ../crypt/crypt.texi
@include ../linuxthreads/linuxthreads.texi
@include lang.texi
@include header.texi
diff --git a/manual/charset.texi b/manual/charset.texi
new file mode 100644
index 0000000000..6179128e3c
--- /dev/null
+++ b/manual/charset.texi
@@ -0,0 +1,2846 @@
+@node Character Set Handling, Locales, String and Array Utilities, Top
+@c %MENU% Support for extended character sets
+@chapter Character Set Handling
+
+@ifnottex
+@macro cal{text}
+\text\
+@end macro
+@end ifnottex
+
+Character sets used in the early days of computers had only six, seven,
+or eight bits for each character. In no case more bits than would fit
+into one byte which nowadays is almost exclusively @w{8 bits} wide.
+This of course leads to several problems once not all characters needed
+at one time can be represented by the up to 256 available characters.
+This chapter shows the functionality which was added to the C library to
+overcome this problem.
+
+@menu
+* Extended Char Intro:: Introduction to Extended Characters.
+* Charset Function Overview:: Overview about Character Handling
+ Functions.
+* Restartable multibyte conversion:: Restartable multibyte conversion
+ Functions.
+* Non-reentrant Conversion:: Non-reentrant Conversion Function.
+* Generic Charset Conversion:: Generic Charset Conversion.
+@end menu
+
+
+@node Extended Char Intro
+@section Introduction to Extended Characters
+
+To overcome the limitations of character sets with a 1:1 relation
+between bytes and characters people came up with a variety of solutions.
+The remainder of this section gives a few examples to help understanding
+the design decision made while developing the functionality of the @w{C
+library} to support them.
+
+@cindex internal representation
+A distinction we have to make right away is between internal and
+external representation. @dfn{Internal representation} means the
+representation used by a program while keeping the text in memory.
+External representations are used when text is stored or transmitted
+through whatever communication channel.
+
+Traditionally there was no difference between the two representations.
+It was equally comfortable and useful to use the same one-byte
+representation internally and externally. This changes with more and
+larger character sets.
+
+One of the problems to overcome with the internal representation is
+handling text which were externally encoded using different character
+sets. Assume a program which reads two texts and compares them using
+some metric. The comparison can be usefully done only if the texts are
+internally kept in a common format.
+
+@cindex wide character
+For such a common format (@math{=} character set) eight bits are certainly
+not enough anymore. So the smallest entity will have to grow: @dfn{wide
+characters} will be used. Here instead of one byte one uses two or four
+(three are not good to address in memory and more than four bytes seem
+not to be necessary).
+
+@cindex Unicode
+@cindex ISO 10646
+As shown in some other part of this manual
+@c !!! Ahem, wide char string functions are not yet covered -- drepper
+there exists a completely new family of functions which can handle texts
+of this kinds in memory. The most commonly used character set for such
+internal wide character representations are Unicode and @w{ISO 10646}.
+The former is a subset of the later and used when wide characters are
+chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
+@cindex UCS2
+@cindex UCS4
+encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
+(@math{= 32} bits).
+
+To represent wide characters the @code{char} type is certainly not
+suitable. For this reason the @w{ISO C} standard introduces a new type
+which is designed to keep one character of a wide character string. To
+maintain the similarity there is also a type corresponding to @code{int}
+for those functions which take a single wide character.
+
+@comment stddef.h
+@comment ISO
+@deftp {Data type} wchar_t
+This data type is used as the base type for wide character strings.
+I.e., arrays of objects of this type are the equivalent of @code{char[]}
+for multibyte character strings. The type is defined in @file{stddef.h}.
+
+The @w{ISO C89} standard, where this type was introduced, does not say
+anything specific about the representation. It only requires that this
+type is capable to store all elements of the basic character set.
+Therefore it would be legitimate to define @code{wchar_t} and
+@code{char}. This might make sense for embedded systems.
+
+But for GNU systems this type is always 32 bits wide. It is therefore
+capable to represent all UCS4 value therefore covering all of @w{ISO
+10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and
+thereby follow Unicode very strictly. This is perfectly fine with the
+standard but it also means that to represent all characters fro Unicode
+and @w{ISO 10646} one has to use surrogate character which is in fact a
+multi-wide-character encoding. But this contradicts the purpose of the
+@code{wchar_t} type.
+@end deftp
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} wint_t
+@code{wint_t} is a data type used for parameters and variables which
+contain a single wide character. As the name already suggests it is the
+equivalent to @code{int} when using the normal @code{char} strings. The
+types @code{wchar_t} and @code{wint_t} have often the same
+representation if their size if 32 bits wide but if @code{wchar_t} is
+defined as @code{char} the type @code{wint_t} must be defined as
+@code{int} due to the parameter promotion.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h} and got introduced in the second
+amendment to @w{ISO C 89}.
+@end deftp
+
+As there are for the @code{char} data type there also exist macros
+specifying the minimum and maximum value representable in an object of
+type @code{wchar_t}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MIN
+The macro @code{WCHAR_MIN} evaluates to the minimum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WCHAR_MAX
+The macro @code{WCHAR_MIN} evaluates to the maximum value representable
+by an object of type @code{wint_t}.
+
+This macro got introduced in the second amendment to @w{ISO C89}.
+@end deftypevr
+
+Another special wide character value is the equivalent to @code{EOF}.
+
+@comment wchar.h
+@comment ISO
+@deftypevr Macro wint_t WEOF
+The macro @code{WEOF} evaluates to a constant expression of type
+@code{wint_t} whose value is different from any member of the extended
+character set.
+
+@code{WEOF} need not be the same value as @code{EOF} and unlike
+@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like
+
+@smallexample
+@{
+ int c;
+ ...
+ while ((c = getc (fp)) < 0)
+ ...
+@}
+@end smallexample
+
+@noindent
+has to be rewritten to explicitly use @code{WEOF} when wide characters
+are used.
+
+@smallexample
+@{
+ wint_t c;
+ ...
+ while ((c = wgetc (fp)) != WEOF)
+ ...
+@}
+@end smallexample
+
+@pindex wchar.h
+This macro was introduced in the second amendment to @w{ISO C89} and is
+defined in @file{wchar.h}.
+@end deftypevr
+
+
+These internal representations present problems when it comes to storing
+and transmitting them. Since a single wide character consists of more
+than one byte they are effected by byte-ordering. I.e., machines with
+different endianesses would see different value accessing the same data.
+This also applies for communication protocols which are all byte-based
+and therefore the sender has to decide about splitting the wide
+character in bytes. A last but not least important point is that wide
+characters often require more storage space than an customized byte
+oriented character set.
+
+@cindex multibyte character
+This is why most of the time an external encoding which is different
+from the internal encoding is used if the later is UCS2 or UCS4. The
+external encoding is byte-based and can be chosen appropriately for the
+environment and for the texts to be handled. There exists a variety of
+different character sets which can be used which is too much to be
+handled completely here. We restrict ourself here to a description of
+the major groups. All of the ASCII-based character sets fulfill one
+requirement: they are ``filesystem safe''. This means that the
+character @code{'/'} is used in the encoding @emph{only} to represent
+itself. Things are a bit different for character like EBCDIC but if the
+operation system does not understand EBCDIC directly the parameters to
+system calls have to be converted first anyhow.
+
+@itemize @bullet
+@item
+The simplest character sets are one-byte character sets. There can be
+only up to 256 characters (for @w{8 bit} character sets) which is not
+sufficient to cover all languages but might be sufficient to handle a
+specific text. Another reason to choose this is because of constraints
+from interaction with other programs.
+
+@cindex ISO 2022
+@item
+The @w{ISO 2022} standard defines a mechanism for extended character
+sets where one character @emph{can} be represented by more than one
+byte. This is achieved by associating a state with the text. Embedded
+in the text can be characters which can be used to change the state.
+Each byte in the text might have a different interpretation in each
+state. The state might even influence whether a given byte stands for a
+character on its own or whether it has to be combined with some more
+bytes.
+
+@cindex EUC
+@cindex SJIS
+In most uses of @w{ISO 2022} the defined character sets do not allow
+state changes which cover more than the next character. This has the
+big advantage that whenever one can identify the beginning of the byte
+sequence of a character one can interpret a text correctly. Examples of
+character sets using this policy are the various EUC character sets
+(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
+or SJIS (Shift JIS, a Japanese encoding).
+
+But there are also character sets using a state which is valid for more
+than one character and has to be changed by another byte sequence.
+Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
+
+@item
+@cindex ISO 6937
+Early attempts to fix 8 bit character sets for other languages using the
+Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes
+representing characters like the acute accent do not produce output on
+there on. One has to combine them with other characters. E.g., the
+byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by
+lower-case `a') to get the ``small a with acute'' character. To get the
+acute accent character on its on one has to write @code{0xc2 0x20} (the
+non-spacing acute followed by a space).
+
+This type of characters sets is quite frequently used in embedded
+systems such as video text.
+
+@item
+@cindex UTF-8
+Instead of converting the Unicode or @w{ISO 10646} text used internally
+it is often also sufficient to simply use an encoding different then
+UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an
+encoding: UTF-8. This encoding is able to represent all of @w{ISO
+10464} 31 bits in a byte string of length one to seven.
+
+@cindex UTF-7
+There were a few other attempts to encode @w{ISO 10646} such as UTF-7
+but UTF-8 is today the only encoding which should be used. In fact,
+UTF-8 will hopefully soon be the only external which has to be
+supported. It proofs to be universally usable and the only disadvantage
+is that it favor Latin languages very much by making the byte string
+representation of other scripts (Cyrillic, Greek, Asian scripts) longer
+than necessary if using a specific character set for these scripts. But
+with methods like the Unicode compression scheme one can overcome these
+problems and the ever growing memory and storage capacities do the rest.
+@end itemize
+
+The question remaining now is: how to select the character set or
+encoding to use. The answer is mostly: you cannot decide about it
+yourself, it is decided by the developers of the system or the majority
+of the users. Since the goal is interoperability one has to use
+whatever the other people one works with use. If there are no
+constraints the selection is based on the requirements the expected
+circle of users will have. I.e., if a project is expected to only be
+used in, say, Russia it is fine to use KOI8-R or a similar character
+set. But if at the same time people from, say, Greek are participating
+one should use a character set which allows all people to collaborate.
+
+A general advice here could be: go with the most general character set,
+namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems
+about users not being able to use their own language adequately are a
+thing of the past.
+
+One final comment about the choice of the wide character representation
+is necessary at this point. We have said above that the natural choice
+is using Unicode or @w{ISO 10646}. This is not specified in any
+standard, though. The @w{ISO C} standard does not specify anything
+specific about the @code{wchar_t} type. There might be systems where
+the developers decided differently. Therefore one should as much as
+possible avoid making assumption about the wide character representation
+although GNU systems will always work as described above. If the
+programmer uses only the functions provided by the C library to handle
+wide character strings there should not be any compatibility problems
+with other systems.
+
+@node Charset Function Overview
+@section Overview about Character Handling Functions
+
+A Unix @w{C library} contains three different sets of functions in two
+families to handling character set conversion. The one function family
+is specified in the @w{ISO C} standard and therefore is portable even
+beyond the Unix world.
+
+The most commonly known set of functions, coming from the @w{ISO C89}
+standard, is unfortunately the least useful one. In fact, these
+functions should be avoided whenever possible, especially when
+developing libraries (as opposed to applications).
+
+The second family o functions got introduced in the early Unix standards
+(XPG2) and is still part of the latest and greatest Unix standard:
+@w{Unix 98}. It is also the most powerful and useful set of functions.
+But we will start with the functions defined in the second amendment to
+@w{ISO C89}.
+
+@node Restartable multibyte conversion
+@section Restartable Multibyte Conversion Functions
+
+The @w{ISO C} standard defines functions to convert strings from a
+multibyte representation to wide character strings. There are a number
+of peculiarities:
+
+@itemize @bullet
+@item
+The character set assumed for the multibyte encoding is not specified
+as an argument to the functions. Instead the character set specified by
+the @code{LC_CTYPE} category of the current locale is used; see
+@ref{Locale Categories}.
+
+@item
+The functions handling more than one character at a time require NUL
+terminated strings as the argument. I.e., converting blocks of text
+does not work unless one can add a NUL byte at an appropriate place.
+The GNU C library contains some extensions the standard which allow
+specifying a size but basically they also expect terminated strings.
+@end itemize
+
+Despite these limitations the @w{ISO C} functions can very well be used
+in many contexts. In graphical user interfaces, for instance, it is not
+uncommon to have functions which require text to be displayed in a wide
+character string if it is not simple ASCII. The text itself might come
+from a file with translations and of course to user should decide about
+the current locale which determines the translation and therefore also
+the external encoding used. In such a situation (and many others) the
+functions described here are perfect. If more freedom while performing
+the conversion is necessary take a look at the @code{iconv} functions
+(@pxref{Generic Charset Conversion})
+
+@menu
+* Selecting the Conversion:: Selecting the conversion and its properties.
+* Keeping the state:: Representing the state of the conversion.
+* Converting a Character:: Converting Single Characters.
+* Converting Strings:: Converting Multibyte and Wide Character
+ Strings.
+* Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
+@end menu
+
+@node Selecting the Conversion
+@subsection Selecting the conversion and its properties
+
+We already said above that the currently selected locale for the
+@code{LC_CTYPE} category decides about the conversion which is performed
+by the functions we are about to describe. Each locale uses its own
+character set (given as an argument to @code{localedef}) and this is the
+one assumed as the external multibyte encoding. The wide character
+character set always is UCS4. So we can see here already where the
+limitations of these conversion functions are.
+
+A characteristic of each multibyte character set is the maximum number
+of bytes which can be necessary to represent one character. This
+information is quite important when writing code which uses the
+conversion functions. In the examples below we will see some examples.
+The @w{ISO C} standard defines two macros which provide this information.
+
+
+@comment limits.h
+@comment ISO
+@deftypevr Macro int MB_LEN_MAX
+This macro specifies the maximum number of bytes in the multibyte
+sequence for a single character in any of the supported locales. It is
+a compile-time constant and it is defined in @file{limits.h}.
+@pindex limits.h
+@end deftypevr
+
+@comment stdlib.h
+@comment ISO
+@deftypevr Macro int MB_CUR_MAX
+@code{MB_CUR_MAX} expands into a positive integer expression that is the
+maximum number of bytes in a multibyte character in the current locale.
+The value is never greater than @code{MB_LEN_MAX}. Unlike
+@code{MB_LEN_MAX} this macro need not be a compile-time constant and in
+fact, in the GNU C library it is not.
+
+@pindex stdlib.h
+@code{MB_CUR_MAX} is defined in @file{stdlib.h}.
+@end deftypevr
+
+Two different macros are necessary since strictly @w{ISO C89} compiles
+do not allow variable length array definitions but still it is desirable
+to avoid dynamic allocation. This incomplete piece of code shows the
+problem:
+
+@smallexample
+@{
+ char buf[MB_LEN_MAX];
+ ssize_t len = 0;
+
+ while (! feof (fp))
+ @{
+ fread (&buf[len], 1, MB_CUR_MAX - len, fp);
+ /* @r{... process} buf */
+ len -= used;
+ @}
+@}
+@end smallexample
+
+The code in the inner loop is expected to have always enough bytes in
+the array @var{buf} to convert one multibyte character. The array
+@var{buf} has to be sized statically since many compilers do not allow a
+variable size. The @code{fread} call makes sure that always
+@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no
+problem if @code{MB_CUR_MAX} is not a compile-time constant.
+
+
+@node Keeping the state
+@subsection Representing the state of the conversion
+
+@cindex stateful
+In the introduction of this chapter it was said that certain character
+sets use a @dfn{stateful} encoding. I.e., the encoded values depend in
+some way on the previous byte in the text.
+
+Since the conversion functions allow converting a text in more than one
+step we must have a way to pass this information from one call of the
+functions to another.
+
+@comment wchar.h
+@comment ISO
+@deftp {Data type} mbstate_t
+@cindex shift state
+A variable of type @code{mbstate_t} can contain all the information
+about the @dfn{shift state} needed from one call to a conversion
+function to another.
+
+@pindex wchar.h
+This type is defined in @file{wchar.h}. It got introduced in the second
+amendment to @w{ISO C89}.
+@end deftp
+
+To use objects of this type the programmer has to define such objects
+(normally as local variables on the stack) and pass a pointer to the
+object to the conversion functions. This way the conversion function
+can update the object if the current multibyte character set is
+stateful.
+
+There is no specific function or initializer to put the state object in
+any specific state. The rules are that the object should always
+represent the initial state before the first use and this is achieved by
+clearing the whole variable with code such as follows:
+
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{from now on @var{state} can be used.} */
+ ...
+@}
+@end smallexample
+
+When using the conversion functions to generate output it is often
+necessary to test whether current state corresponds to the initial
+state. This is necessary, for example, to decide whether or not to emit
+escape sequences to set the state to the initial state at certain
+sequence points. Communication protocols often require this.
+
+@comment wchar.h
+@comment ISO
+@deftypefun int mbsinit (const mbstate_t *@var{ps})
+This function determines whether the state object pointed to by @var{ps}
+is in the initial state or not. If @var{ps} is no null pointer or the
+object is in the initial state the return value is nonzero. Otherwise
+it is zero.
+
+@pindex wchar.h
+This function was introduced in the second amendment to @w{ISO C89} and
+is declared in @file{wchar.h}.
+@end deftypefun
+
+Code using this function often looks similar to this:
+
+@smallexample
+@{
+ mbstate_t state;
+ memset (&state, '\0', sizeof (state));
+ /* @r{Use @var{state}.} */
+ ...<