diff options
Diffstat (limited to 'manual')
| -rw-r--r-- | manual/Makefile | 4 | ||||
| -rw-r--r-- | manual/chapters.texi | 3 | ||||
| -rw-r--r-- | manual/charset.texi | 2846 | ||||
| -rw-r--r-- | manual/ctype.texi | 521 | ||||
| -rw-r--r-- | manual/filesys.texi | 4 | ||||
| -rw-r--r-- | manual/intro.texi | 2 | ||||
| -rw-r--r-- | manual/lang.texi | 2 | ||||
| -rw-r--r-- | manual/libc.texinfo | 4 | ||||
| -rw-r--r-- | manual/locale.texi | 6 | ||||
| -rw-r--r-- | manual/memory.texi | 3 | ||||
| -rw-r--r-- | manual/stdio.texi | 8 | ||||
| -rw-r--r-- | manual/string.texi | 2 | ||||
| -rw-r--r-- | manual/texis | 2 | ||||
| -rw-r--r-- | manual/top-menu.texi | 70 |
14 files changed, 3432 insertions, 45 deletions
diff --git a/manual/Makefile b/manual/Makefile index e0dad4792c..8eb4d5b69e 100644 --- a/manual/Makefile +++ b/manual/Makefile @@ -49,7 +49,7 @@ endif mkinstalldirs = $(..)scripts/mkinstalldirs chapters = $(addsuffix .texi, \ - intro errno memory ctype string mbyte locale \ + intro errno memory ctype string charset locale \ message search pattern io stdio llio filesys \ pipe socket terminal math arith time setjmp \ signal startup process job nss users sysinfo conf) @@ -74,7 +74,7 @@ libc.dvi: texinfo.tex # Generate the summary from the Texinfo source files for each chapter. summary.texi: stamp-summary ; stamp-summary: summary.awk $(filter-out summary.texi, $(texis)) - $(AWK) -f $^ | sort -t'^L' -df +0 -1 | tr '\014' '\012' > summary-tmp + $(AWK) -f $^ | sort -t'' -df +0 -1 | tr '\014' '\012' > summary-tmp $(move-if-change) summary-tmp summary.texi touch $@ diff --git a/manual/chapters.texi b/manual/chapters.texi index a5a8a57903..bf7c4c01e0 100644 --- a/manual/chapters.texi +++ b/manual/chapters.texi @@ -3,7 +3,7 @@ @include memory.texi @include ctype.texi @include string.texi -@include mbyte.texi +@include charset.texi @include locale.texi @include message.texi @include search.texi @@ -27,6 +27,7 @@ @include users.texi @include sysinfo.texi @include conf.texi +@include ../crypt/crypt.texi @include ../linuxthreads/linuxthreads.texi @include lang.texi @include header.texi diff --git a/manual/charset.texi b/manual/charset.texi new file mode 100644 index 0000000000..6179128e3c --- /dev/null +++ b/manual/charset.texi @@ -0,0 +1,2846 @@ +@node Character Set Handling, Locales, String and Array Utilities, Top +@c %MENU% Support for extended character sets +@chapter Character Set Handling + +@ifnottex +@macro cal{text} +\text\ +@end macro +@end ifnottex + +Character sets used in the early days of computers had only six, seven, +or eight bits for each character. In no case more bits than would fit +into one byte which nowadays is almost exclusively @w{8 bits} wide. +This of course leads to several problems once not all characters needed +at one time can be represented by the up to 256 available characters. +This chapter shows the functionality which was added to the C library to +overcome this problem. + +@menu +* Extended Char Intro:: Introduction to Extended Characters. +* Charset Function Overview:: Overview about Character Handling + Functions. +* Restartable multibyte conversion:: Restartable multibyte conversion + Functions. +* Non-reentrant Conversion:: Non-reentrant Conversion Function. +* Generic Charset Conversion:: Generic Charset Conversion. +@end menu + + +@node Extended Char Intro +@section Introduction to Extended Characters + +To overcome the limitations of character sets with a 1:1 relation +between bytes and characters people came up with a variety of solutions. +The remainder of this section gives a few examples to help understanding +the design decision made while developing the functionality of the @w{C +library} to support them. + +@cindex internal representation +A distinction we have to make right away is between internal and +external representation. @dfn{Internal representation} means the +representation used by a program while keeping the text in memory. +External representations are used when text is stored or transmitted +through whatever communication channel. + +Traditionally there was no difference between the two representations. +It was equally comfortable and useful to use the same one-byte +representation internally and externally. This changes with more and +larger character sets. + +One of the problems to overcome with the internal representation is +handling text which were externally encoded using different character +sets. Assume a program which reads two texts and compares them using +some metric. The comparison can be usefully done only if the texts are +internally kept in a common format. + +@cindex wide character +For such a common format (@math{=} character set) eight bits are certainly +not enough anymore. So the smallest entity will have to grow: @dfn{wide +characters} will be used. Here instead of one byte one uses two or four +(three are not good to address in memory and more than four bytes seem +not to be necessary). + +@cindex Unicode +@cindex ISO 10646 +As shown in some other part of this manual +@c !!! Ahem, wide char string functions are not yet covered -- drepper +there exists a completely new family of functions which can handle texts +of this kinds in memory. The most commonly used character set for such +internal wide character representations are Unicode and @w{ISO 10646}. +The former is a subset of the later and used when wide characters are +chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the +@cindex UCS2 +@cindex UCS4 +encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4 +(@math{= 32} bits). + +To represent wide characters the @code{char} type is certainly not +suitable. For this reason the @w{ISO C} standard introduces a new type +which is designed to keep one character of a wide character string. To +maintain the similarity there is also a type corresponding to @code{int} +for those functions which take a single wide character. + +@comment stddef.h +@comment ISO +@deftp {Data type} wchar_t +This data type is used as the base type for wide character strings. +I.e., arrays of objects of this type are the equivalent of @code{char[]} +for multibyte character strings. The type is defined in @file{stddef.h}. + +The @w{ISO C89} standard, where this type was introduced, does not say +anything specific about the representation. It only requires that this +type is capable to store all elements of the basic character set. +Therefore it would be legitimate to define @code{wchar_t} and +@code{char}. This might make sense for embedded systems. + +But for GNU systems this type is always 32 bits wide. It is therefore +capable to represent all UCS4 value therefore covering all of @w{ISO +10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and +thereby follow Unicode very strictly. This is perfectly fine with the +standard but it also means that to represent all characters fro Unicode +and @w{ISO 10646} one has to use surrogate character which is in fact a +multi-wide-character encoding. But this contradicts the purpose of the +@code{wchar_t} type. +@end deftp + +@comment wchar.h +@comment ISO +@deftp {Data type} wint_t +@code{wint_t} is a data type used for parameters and variables which +contain a single wide character. As the name already suggests it is the +equivalent to @code{int} when using the normal @code{char} strings. The +types @code{wchar_t} and @code{wint_t} have often the same +representation if their size if 32 bits wide but if @code{wchar_t} is +defined as @code{char} the type @code{wint_t} must be defined as +@code{int} due to the parameter promotion. + +@pindex wchar.h +This type is defined in @file{wchar.h} and got introduced in the second +amendment to @w{ISO C 89}. +@end deftp + +As there are for the @code{char} data type there also exist macros +specifying the minimum and maximum value representable in an object of +type @code{wchar_t}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MIN +The macro @code{WCHAR_MIN} evaluates to the minimum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MAX +The macro @code{WCHAR_MIN} evaluates to the maximum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +Another special wide character value is the equivalent to @code{EOF}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WEOF +The macro @code{WEOF} evaluates to a constant expression of type +@code{wint_t} whose value is different from any member of the extended +character set. + +@code{WEOF} need not be the same value as @code{EOF} and unlike +@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like + +@smallexample +@{ + int c; + ... + while ((c = getc (fp)) < 0) + ... +@} +@end smallexample + +@noindent +has to be rewritten to explicitly use @code{WEOF} when wide characters +are used. + +@smallexample +@{ + wint_t c; + ... + while ((c = wgetc (fp)) != WEOF) + ... +@} +@end smallexample + +@pindex wchar.h +This macro was introduced in the second amendment to @w{ISO C89} and is +defined in @file{wchar.h}. +@end deftypevr + + +These internal representations present problems when it comes to storing +and transmitting them. Since a single wide character consists of more +than one byte they are effected by byte-ordering. I.e., machines with +different endianesses would see different value accessing the same data. +This also applies for communication protocols which are all byte-based +and therefore the sender has to decide about splitting the wide +character in bytes. A last but not least important point is that wide +characters often require more storage space than an customized byte +oriented character set. + +@cindex multibyte character +This is why most of the time an external encoding which is different +from the internal encoding is used if the later is UCS2 or UCS4. The +external encoding is byte-based and can be chosen appropriately for the +environment and for the texts to be handled. There exists a variety of +different character sets which can be used which is too much to be +handled completely here. We restrict ourself here to a description of +the major groups. All of the ASCII-based character sets fulfill one +requirement: they are ``filesystem safe''. This means that the +character @code{'/'} is used in the encoding @emph{only} to represent +itself. Things are a bit different for character like EBCDIC but if the +operation system does not understand EBCDIC directly the parameters to +system calls have to be converted first anyhow. + +@itemize @bullet +@item +The simplest character sets are one-byte character sets. There can be +only up to 256 characters (for @w{8 bit} character sets) which is not +sufficient to cover all languages but might be sufficient to handle a +specific text. Another reason to choose this is because of constraints +from interaction with other programs. + +@cindex ISO 2022 +@item +The @w{ISO 2022} standard defines a mechanism for extended character +sets where one character @emph{can} be represented by more than one +byte. This is achieved by associating a state with the text. Embedded +in the text can be characters which can be used to change the state. +Each byte in the text might have a different interpretation in each +state. The state might even influence whether a given byte stands for a +character on its own or whether it has to be combined with some more +bytes. + +@cindex EUC +@cindex SJIS +In most uses of @w{ISO 2022} the defined character sets do not allow +state changes which cover more than the next character. This has the +big advantage that whenever one can identify the beginning of the byte +sequence of a character one can interpret a text correctly. Examples of +character sets using this policy are the various EUC character sets +(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) +or SJIS (Shift JIS, a Japanese encoding). + +But there are also character sets using a state which is valid for more +than one character and has to be changed by another byte sequence. +Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. + +@item +@cindex ISO 6937 +Early attempts to fix 8 bit character sets for other languages using the +Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes +representing characters like the acute accent do not produce output on +there on. One has to combine them with other characters. E.g., the +byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by +lower-case `a') to get the ``small a with acute'' character. To get the +acute accent character on its on one has to write @code{0xc2 0x20} (the +non-spacing acute followed by a space). + +This type of characters sets is quite frequently used in embedded +systems such as video text. + +@item +@cindex UTF-8 +Instead of converting the Unicode or @w{ISO 10646} text used internally +it is often also sufficient to simply use an encoding different then +UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an +encoding: UTF-8. This encoding is able to represent all of @w{ISO +10464} 31 bits in a byte string of length one to seven. + +@cindex UTF-7 +There were a few other attempts to encode @w{ISO 10646} such as UTF-7 +but UTF-8 is today the only encoding which should be used. In fact, +UTF-8 will hopefully soon be the only external which has to be +supported. It proofs to be universally usable and the only disadvantage +is that it favor Latin languages very much by making the byte string +representation of other scripts (Cyrillic, Greek, Asian scripts) longer +than necessary if using a specific character set for these scripts. But +with methods like the Unicode compression scheme one can overcome these +problems and the ever growing memory and storage capacities do the rest. +@end itemize + +The question remaining now is: how to select the character set or +encoding to use. The answer is mostly: you cannot decide about it +yourself, it is decided by the developers of the system or the majority +of the users. Since the goal is interoperability one has to use +whatever the other people one works with use. If there are no +constraints the selection is based on the requirements the expected +circle of users will have. I.e., if a project is expected to only be +used in, say, Russia it is fine to use KOI8-R or a similar character +set. But if at the same time people from, say, Greek are participating +one should use a character set which allows all people to collaborate. + +A general advice here could be: go with the most general character set, +namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems +about users not being able to use their own language adequately are a +thing of the past. + +One final comment about the choice of the wide character representation +is necessary at this point. We have said above that the natural choice +is using Unicode or @w{ISO 10646}. This is not specified in any +standard, though. The @w{ISO C} standard does not specify anything +specific about the @code{wchar_t} type. There might be systems where +the developers decided differently. Therefore one should as much as +possible avoid making assumption about the wide character representation +although GNU systems will always work as described above. If the +programmer uses only the functions provided by the C library to handle +wide character strings there should not be any compatibility problems +with other systems. + +@node Charset Function Overview +@section Overview about Character Handling Functions + +A Unix @w{C library} contains three different sets of functions in two +families to handling character set conversion. The one function family +is specified in the @w{ISO C} standard and therefore is portable even +beyond the Unix world. + +The most commonly known set of functions, coming from the @w{ISO C89} +standard, is unfortunately the least useful one. In fact, these +functions should be avoided whenever possible, especially when +developing libraries (as opposed to applications). + +The second family o functions got introduced in the early Unix standards +(XPG2) and is still part of the latest and greatest Unix standard: +@w{Unix 98}. It is also the most powerful and useful set of functions. +But we will start with the functions defined in the second amendment to +@w{ISO C89}. + +@node Restartable multibyte conversion +@section Restartable Multibyte Conversion Functions + +The @w{ISO C} standard defines functions to convert strings from a +multibyte representation to wide character strings. There are a number +of peculiarities: + +@itemize @bullet +@item +The character set assumed for the multibyte encoding is not specified +as an argument to the functions. Instead the character set specified by +the @code{LC_CTYPE} category of the current locale is used; see +@ref{Locale Categories}. + +@item +The functions handling more than one character at a time require NUL +terminated strings as the argument. I.e., converting blocks of text +does not work unless one can add a NUL byte at an appropriate place. +The GNU C library contains some extensions the standard which allow +specifying a size but basically they also expect terminated strings. +@end itemize + +Despite these limitations the @w{ISO C} functions can very well be used +in many contexts. In graphical user interfaces, for instance, it is not +uncommon to have functions which require text to be displayed in a wide +character string if it is not simple ASCII. The text itself might come +from a file with translations and of course to user should decide about +the current locale which determines the translation and therefore also +the external encoding used. In such a situation (and many others) the +functions described here are perfect. If more freedom while performing +the conversion is necessary take a look at the @code{iconv} functions +(@pxref{Generic Charset Conversion}) + +@menu +* Selecting the Conversion:: Selecting the conversion and its properties. +* Keeping the state:: Representing the state of the conversion. +* Converting a Character:: Converting Single Characters. +* Converting Strings:: Converting Multibyte and Wide Character + Strings. +* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. +@end menu + +@node Selecting the Conversion +@subsection Selecting the conversion and its properties + +We already said above that the currently selected locale for the +@code{LC_CTYPE} category decides about the conversion which is performed +by the functions we are about to describe. Each locale uses its own +character set (given as an argument to @code{localedef}) and this is the +one assumed as the external multibyte encoding. The wide character +character set always is UCS4. So we can see here already where the +limitations of these conversion functions are. + +A characteristic of each multibyte character set is the maximum number +of bytes which can be necessary to represent one character. This +information is quite important when writing code which uses the +conversion functions. In the examples below we will see some examples. +The @w{ISO C} standard defines two macros which provide this information. + + +@comment limits.h +@comment ISO +@deftypevr Macro int MB_LEN_MAX +This macro specifies the maximum number of bytes in the multibyte +sequence for a single character in any of the supported locales. It is +a compile-time constant and it is defined in @file{limits.h}. +@pindex limits.h +@end deftypevr + +@comment stdlib.h +@comment ISO +@deftypevr Macro int MB_CUR_MAX +@code{MB_CUR_MAX} expands into a positive integer expression that is the +maximum number of bytes in a multibyte character in the current locale. +The value is never greater than @code{MB_LEN_MAX}. Unlike +@code{MB_LEN_MAX} this macro need not be a compile-time constant and in +fact, in the GNU C library it is not. + +@pindex stdlib.h +@code{MB_CUR_MAX} is defined in @file{stdlib.h}. +@end deftypevr + +Two different macros are necessary since strictly @w{ISO C89} compiles +do not allow variable length array definitions but still it is desirable +to avoid dynamic allocation. This incomplete piece of code shows the +problem: + +@smallexample +@{ + char buf[MB_LEN_MAX]; + ssize_t len = 0; + + while (! feof (fp)) + @{ + fread (&buf[len], 1, MB_CUR_MAX - len, fp); + /* @r{... process} buf */ + len -= used; + @} +@} +@end smallexample + +The code in the inner loop is expected to have always enough bytes in +the array @var{buf} to convert one multibyte character. The array +@var{buf} has to be sized statically since many compilers do not allow a +variable size. The @code{fread} call makes sure that always +@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no +problem if @code{MB_CUR_MAX} is not a compile-time constant. + + +@node Keeping the state +@subsection Representing the state of the conversion + +@cindex stateful +In the introduction of this chapter it was said that certain character +sets use a @dfn{stateful} encoding. I.e., the encoded values depend in +some way on the previous byte in the text. + +Since the conversion functions allow converting a text in more than one +step we must have a way to pass this information from one call of the +functions to another. + +@comment wchar.h +@comment ISO +@deftp {Data type} mbstate_t +@cindex shift state +A variable of type @code{mbstate_t} can contain all the information +about the @dfn{shift state} needed from one call to a conversion +function to another. + +@pindex wchar.h +This type is defined in @file{wchar.h}. It got introduced in the second +amendment to @w{ISO C89}. +@end deftp + +To use objects of this type the programmer has to define such objects +(normally as local variables on the stack) and pass a pointer to the +object to the conversion functions. This way the conversion function +can update the object if the current multibyte character set is +stateful. + +There is no specific function or initializer to put the state object in +any specific state. The rules are that the object should always +represent the initial state before the first use and this is achieved by +clearing the whole variable with code such as follows: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{from now on @var{state} can be used.} */ + ... +@} +@end smallexample + +When using the conversion functions to generate output it is often +necessary to test whether current state corresponds to the initial +state. This is necessary, for example, to decide whether or not to emit +escape sequences to set the state to the initial state at certain +sequence points. Communication protocols often require this. + +@comment wchar.h +@comment ISO +@deftypefun int mbsinit (const mbstate_t *@var{ps}) +This function determines whether the state object pointed to by @var{ps} +is in the initial state or not. If @var{ps} is no null pointer or the +object is in the initial state the return value is nonzero. Otherwise +it is zero. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Code using this function often looks similar to this: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{Use @var{state}.} */ + ...< |
