diff options
| author | Ulrich Drepper <drepper@redhat.com> | 2001-02-11 01:50:24 +0000 |
|---|---|---|
| committer | Ulrich Drepper <drepper@redhat.com> | 2001-02-11 01:50:24 +0000 |
| commit | 8a2f1f5b5f7cdfcaf465415736a75a582bc5562a (patch) | |
| tree | a94b4ba60204958bfa7296bb9f587b1936c6e116 /manual/string.texi | |
| parent | 2cca386760af59e3040ca3d41cff6c2bf890e041 (diff) | |
| download | glibc-8a2f1f5b5f7cdfcaf465415736a75a582bc5562a.tar.xz glibc-8a2f1f5b5f7cdfcaf465415736a75a582bc5562a.zip | |
Document wide character string functions.
Diffstat (limited to 'manual/string.texi')
| -rw-r--r-- | manual/string.texi | 905 |
1 files changed, 794 insertions, 111 deletions
diff --git a/manual/string.texi b/manual/string.texi index e2f51869ef..6953023d6a 100644 --- a/manual/string.texi +++ b/manual/string.texi @@ -49,7 +49,7 @@ and some common pitfalls. If you are already familiar with this material, you can skip this section. @cindex string -@cindex null character +@cindex multibyte character string A @dfn{string} is an array of @code{char} objects. But string-valued variables are usually declared to be pointers of type @code{char *}. Such variables do not include space for the text of a string; that has @@ -60,21 +60,34 @@ variable. Alternatively you can store a @dfn{null pointer} in the pointer variable. The null pointer does not point anywhere, so attempting to reference the string it points to gets an error. +@cindex wide character string +``string'' normally refers to multibyte character strings as opposed to +wide character strings. Wide character strings are arrays of type +@code{wchar_t} and as for multibyte character strings usually pointers +of type @code{wchar_t *} are used. + +@cindex null character +@cindex null wide character By convention, a @dfn{null character}, @code{'\0'}, marks the end of a -string. For example, in testing to see whether the @code{char *} -variable @var{p} points to a null character marking the end of a string, -you can write @code{!*@var{p}} or @code{*@var{p} == '\0'}. +multibyte character string and the @dfn{null wide character}, +@code{L'\0'}, marks the end of a wide character string. For example, in +testing to see whether the @code{char *} variable @var{p} points to a +null character marking the end of a string, you can write +@code{!*@var{p}} or @code{*@var{p} == '\0'}. A null character is quite different conceptually from a null pointer, although both are represented by the integer @code{0}. @cindex string literal @dfn{String literals} appear in C program source as strings of -characters between double-quote characters (@samp{"}). In @w{ISO C}, -string literals can also be formed by @dfn{string concatenation}: -@code{"a" "b"} is the same as @code{"ab"}. Modification of string -literals is not allowed by the GNU C compiler, because literals -are placed in read-only storage. +characters between double-quote characters (@samp{"}) where the initial +double-quote character is immediately preceded by a capital @samp{L} +(ell) character (as in @code{L"foo"}). In @w{ISO C}, string literals +can also be formed by @dfn{string concatenation}: @code{"a" "b"} is the +same as @code{"ab"}. For wide character strings one can either use +@code{L"a" L"b"} or @code{L"a" "b"}. Modification of string literals is +not allowed by the GNU C compiler, because literals are placed in +read-only storage. Character arrays that are declared @code{const} cannot be modified either. It's generally good style to declare non-modifiable string @@ -104,38 +117,97 @@ checks for overflowing the array. Many of the library functions an extra byte to hold the null character that marks the end of the string. +@cindex single-byte string +@cindex multibyte string +Originally strings were sequences of bytes where each byte represents a +single character. This is still true today if the strings are encoded +using a single-byte character encoding. Things are different if the +strings are encoded using a multibyte encoding (for more information on +encodings see @ref{Extended Char Intro}). There is no difference in +the programming interface for these two kind of strings; the programmer +has to be aware of this and interpret the byte sequences accordingly. + +But since there is no separate interface taking care of these +differences the byte-based string functions are sometimes hard to use. +Since the count parameters of these functions specify bytes a call to +@code{strncpy} could cut a multibyte character in the middle and put an +incomplete (and therefore unusable) byte sequence in the target buffer. + +@cindex wide character string +To avoid these problems later versions of the @w{ISO C} standard +introduce a second set of functions which are operating on @dfn{wide +characters} (@pxref{Extended Char Intro}). These functions don't have +the problems the single-byte versions have since every wide character is +a legal, interpretable value. This does not mean that cutting wide +character strings at arbitrary points is without problems. It normally +is for alphabet-based languages (except for non-normalized text) but +languages based on syllables still have the problem that more than one +wide character is necessary to complete a logical unit. This is a +higher level problem which the @w{C library} functions are not designed +to solve. But it is at least good that no invalid byte sequences can be +created. Also, the higher level functions can also much easier operate +on wide character than on multibyte characters so that a general advise +is to use wide characters internally whenever text is more than simply +copied. + +The remaining of this chapter will discuss the functions for handling +wide character strings in parallel with the discussion of the multibyte +character strings since there is almost always an exact equivalent +available. + @node String/Array Conventions @section String and Array Conventions This chapter describes both functions that work on arbitrary arrays or blocks of memory, and functions that are specific to null-terminated -arrays of characters. +arrays of characters and wide characters. Functions that operate on arbitrary blocks of memory have names -beginning with @samp{mem} (such as @code{memcpy}) and invariably take an -argument which specifies the size (in bytes) of the block of memory to +beginning with @samp{mem} and @samp{wmem} (such as @code{memcpy} and +@code{wmemcpy}) and invariably take an argument which specifies the size +(in bytes and wide characters respectively) of the block of memory to operate on. The array arguments and return values for these functions -have type @code{void *}, and as a matter of style, the elements of these -arrays are referred to as ``bytes''. You can pass any kind of pointer -to these functions, and the @code{sizeof} operator is useful in -computing the value for the size argument. - -In contrast, functions that operate specifically on strings have names -beginning with @samp{str} (such as @code{strcpy}) and look for a null -character to terminate the string instead of requiring an explicit size -argument to be passed. (Some of these functions accept a specified +have type @code{void *} or @code{wchar_t}. As a matter of style, the +elements of the arrays used with the @samp{mem} functions are referred +to as ``bytes''. You can pass any kind of pointer to these functions, +and the @code{sizeof} operator is useful in computing the value for the +size argument. Parameters to the @samp{wmem} functions must be of type +@code{wchar_t *}. These functions are not really usable with anything +but arrays of this type. + +In contrast, functions that operate specifically on strings and wide +character strings have names beginning with @samp{str} and @samp{wcs} +respectively (such as @code{strcpy} and @code{wcscpy}) and look for a +null character to terminate the string instead of requiring an explicit +size argument to be passed. (Some of these functions accept a specified maximum length, but they also check for premature termination with a null character.) The array arguments and return values for these -functions have type @code{char *}, and the array elements are referred -to as ``characters''. - -In many cases, there are both @samp{mem} and @samp{str} versions of a -function. The one that is more appropriate to use depends on the exact -situation. When your program is manipulating arbitrary arrays or blocks of -storage, then you should always use the @samp{mem} functions. On the -other hand, when you are manipulating null-terminated strings it is -usually more convenient to use the @samp{str} functions, unless you -already know the length of the string in advance. +functions have type @code{char *} and @code{wchar_t *} respectively, and +the array elements are referred to as ``characters'' and ``wide +characters''. + +In many cases, there are both @samp{mem} and @samp{str}/@samp{wcs} +versions of a function. The one that is more appropriate to use depends +on the exact situation. When your program is manipulating arbitrary +arrays or blocks of storage, then you should always use the @samp{mem} +functions. On the other hand, when you are manipulating null-terminated +strings it is usually more convenient to use the @samp{str}/@samp{wcs} +functions, unless you already know the length of the string in advance. +The @samp{wmem} functions should be used for wide character arrays with +known size. + +@cindex wint_t +@cindex parameter promotion +Some of the memory and string functions take single characters as +arguments. Since a value of type @code{char} is automatically promoted +into an value of type @code{int} when used as a parameter, the functions +are declared with @code{int} as the type of the parameter in question. +In case of the wide character function the situation is similarly: the +parameter type for a single wide character is @code{wint_t} and not +@code{wchar_t}. This would for many implementations not be necessary +since the @code{wchar_t} is large enough to not be automatically +promoted, but since the @w{ISO C} standard does not require such a +choice of types the @code{wint_t} type is used. @node String Length @section String Length @@ -148,8 +220,8 @@ This function is declared in the header file @file{string.h}. @comment ISO @deftypefun size_t strlen (const char *@var{s}) The @code{strlen} function returns the length of the null-terminated -string @var{s}. (In other words, it returns the offset of the terminating -null character within the array.) +string @var{s} in bytes. (In other words, it returns the offset of the +terminating null character within the array.) For example, @smallexample @@ -185,16 +257,55 @@ sizeof (ptr) This is an easy mistake to make when you are working with functions that take string arguments; those arguments are always pointers, not arrays. +It must also be noted that for multibyte encoded strings the return +value does not have to correspond to the number of characters in the +string. To get this value the string can be converted to wide +characters and @code{wcslen} can be used or something like the following +code can be used: + +@smallexample +/* @r{The input is in @code{string}.} + @r{The length is expected in @code{n}.} */ +@{ + mbstate_t t; + char *scopy = string; + /* In initial state. */ + memset (&t, '\0', sizeof (t)); + /* Determine number of characters. */ + n = mbsrtowcs (NULL, &scopy, strlen (scopy), &t); +@} +@end smallexample + +This is cumbersome to do so if the number of characters (as opposed to +bytes) is needed often it is better to work with wide characters. +@end deftypefun + +The wide character equivalent is declared in @file{wchar.h}. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcslen (const wchar_t *@var{ws}) +The @code{wcslen} function is the wide character equivalent to +@code{strlen}. The return value is the number of wide characters in the +wide character string pointed to by @var{ws} (this is also the offset of +the terminating null wide character of @var{ws}). + +Since there are no multi wide character sequences making up one +character the return value is not only the offset in the array, it is +also the number of wide characters. + +This function was introduced in @w{Amendment 1} to @w{ISO C90}. @end deftypefun @comment string.h @comment GNU @deftypefun size_t strnlen (const char *@var{s}, size_t @var{maxlen}) -The @code{strnlen} function returns the length of the null-terminated -string @var{s} is this length is smaller than @var{maxlen}. Otherwise -it returns @var{maxlen}. Therefore this function is equivalent to +The @code{strnlen} function returns the length of the string @var{s} in +bytes if this length is smaller than @var{maxlen} bytes. Otherwise it +returns @var{maxlen}. Therefore this function is equivalent to @code{(strlen (@var{s}) < n ? strlen (@var{s}) : @var{maxlen})} but it -is more efficient. +is more efficient and works even if the string @var{s} is not +null-terminated. @smallexample char string[32] = "hello, world"; @@ -204,7 +315,16 @@ strnlen (string, 5) @result{} 5 @end smallexample -This function is a GNU extension. +This function is a GNU extension and is declared in @file{string.h}. +@end deftypefun + +@comment wchar.h +@comment GNU +@deftypefun size_t wcsnlen (const wchar_t *@var{ws}, size_t @var{maxlen}) +@code{wcsnlen} is the wide character equivalent to @code{strnlen}. The +@var{maxlen} parameter specifies the maximum number of wide characters. + +This function is a GNU extension and is declared in @file{wchar.h}. @end deftypefun @node Copying and Concatenation @@ -212,9 +332,11 @@ This function is a GNU extension. You can use the functions described in this section to copy the contents of strings and arrays, or to append the contents of one string to -another. These functions are declared in the header file -@file{string.h}. +another. The @samp{str} and @samp{mem} functions are declared in the +header file @file{string.h} while the @samp{wstr} and @samp{wmem} +functions are declared in the file @file{wchar.h}. @pindex string.h +@pindex wchar.h @cindex copying strings and arrays @cindex string copy functions @cindex array copy functions @@ -243,7 +365,7 @@ Functions}). @comment string.h @comment ISO -@deftypefun {void *} memcpy (void *@var{to}, const void *@var{from}, size_t @var{size}) +@deftypefun {void *} memcpy (void *restrict @var{to}, const void *restrict @var{from}, size_t @var{size}) The @code{memcpy} function copies @var{size} bytes from the object beginning at @var{from} into the object beginning at @var{to}. The behavior of this function is undefined if the two arrays @var{to} and @@ -262,9 +384,34 @@ memcpy (new, old, arraysize * sizeof (struct foo)); @end smallexample @end deftypefun +@comment wchar.h +@comment ISO +@deftypefun {wchar_t *} wmemcpy (wchar_t *restrict @var{wto}, const wchar_t *restruct @var{wfrom}, size_t @var{size}) +The @code{wmemcpy} function copies @var{size} wide characters from the object +beginning at @var{wfrom} into the object beginning at @var{wto}. The +behavior of this function is undefined if the two arrays @var{wto} and +@var{wfrom} overlap; use @code{wmemmove} instead if overlapping is possible. + +The following is a possible implementation of @code{wmemcpy} but there +are more optimizations possible. + +@smallexample +wchar_t * +wmemcpy (wchar_t *restrict wto, const wchar_t *restrict wfrom, + size_t size) +@{ + return (wchar_t *) memcpy (wto, wfrom, size * sizeof (wchar_t)); +@} +@end smallexample + +The value returned by @code{wmemcpy} is the value of @var{wto}. + +This function was introduced in @w{Amendment 1} to @w{ISO C90}. +@end deftypefun + @comment string.h @comment GNU -@deftypefun {void *} mempcpy (void *@var{to}, const void *@var{from}, size_t @var{size}) +@deftypefun {void *} mempcpy (void *restrict @var{to}, const void *restrict @var{from}, size_t @var{size}) The @code{mempcpy} function is nearly identical to the @code{memcpy} function. It copies @var{size} bytes from the object beginning at @code{from} into the object pointed to by @var{to}. But instead of @@ -289,6 +436,34 @@ combine (void *o1, size_t s1, void *o2, size_t s2) This function is a GNU extension. @end deftypefun +@comment wchar.h +@comment GNU +@deftypefun {wchar_t *} wmempcpy (wchar_t *restrict @var{wto}, const wchar_t *restrict @var{wfrom}, size_t @var{size}) +The @code{wmempcpy} function is nearly identical to the @code{wmemcpy} +function. It copies @var{size} wide characters from the object +beginning at @code{wfrom} into the object pointed to by @var{wto}. But +instead of returning the value of @var{wto} it returns a pointer to the +wide character following the last written wide character in the object +beginning at @var{wto}. I.e., the value is @code{@var{wto} + @var{size}}. + +This function is useful in situations where a number of objects shall be +copied to consecutive memory positions. + +The following is a possible implementation of @code{wmemcpy} but there +are more optimizations possible. + +@smallexample +wchar_t * +wmempcpy (wchar_t *restrict wto, const wchar_t *restrict wfrom, + size_t size) +@{ + return (wchar_t *) mempcpy (wto, wfrom, size * sizeof (wchar_t)); +@} +@end smallexample + +This function is a GNU extension. +@end deftypefun + @comment string.h @comment ISO @deftypefun {void *} memmove (void *@var{to}, const void *@var{from}, size_t @var{size}) @@ -297,11 +472,40 @@ This function is a GNU extension. overlap. In the case of overlap, @code{memmove} is careful to copy the original values of the bytes in the block at @var{from}, including those bytes which also belong to the block at @var{to}. + +The value returned by @code{memmove} is the value of @var{to}. +@end deftypefun + +@comment wchar.h +@comment ISO +@deftypefun {wchar_t *} wmemmove (wchar *@var{wto}, const wchar_t *@var{wfrom}, size_t @var{size}) +@code{wmemmove} copies the @var{size} wide characters at @var{wfrom} +into the @var{size} wide characters at @var{wto}, even if those two +blocks of space overlap. In the case of overlap, @code{memmove} is +careful to copy the original values of the wide characters in the block +at @var{wfrom}, including those wide characters which also belong to the +block at @var{wto}. + +The following is a possible implementation of @code{wmemcpy} but there +are more optimizations possible. + +@smallexample +wchar_t * +wmempcpy (wchar_t *restrict wto, const wchar_t *restrict wfrom, + size_t size) +@{ + return (wchar_t *) mempcpy (wto, wfrom, size * sizeof (wchar_t)); +@} +@end smallexample + +The value returned by @code{wmemmove} is the value of @var{wto}. + +This function is a GNU extension. @end deftypefun @comment string.h @comment SVID -@deftypefun {void *} memccpy (void *@var{to}, const void *@var{from}, int @var{c}, size_t @var{size}) +@deftypefun {void *} memccpy (void *restrict @var{to}, const void *restrict @var{from}, int @var{c}, size_t @var{size}) This function copies no more than @var{size} bytes from @var{from} to @var{to}, stopping if a byte matching @var{c} is found. The return value is a pointer into @var{to} one byte past where @var{c} was copied, @@ -317,18 +521,35 @@ This function copies the value of @var{c} (converted to an object beginning at @var{block}. It returns the value of @var{block}. @end deftypefun +@comment wchar.h +@comment ISO +@deftypefun {wchar_t *} wmemset (wchar_t *@var{block}, wchar_t @var{wc}, size_t @var{size}) +This function copies the value of @var{wc} into each of the first +@var{size} wide characters of the object beginning at @var{block}. It +returns the value of @var{block}. +@end deftypefun + @comment string.h @comment ISO -@deftypefun {char *} strcpy (char *@var{to}, const char *@var{from}) +@deftypefun {char *} strcpy (char *restrict @var{to}, const char *restrict @var{from}) This copies characters from the string @var{from} (up to and including the terminating null character) into the string @var{to}. Like @code{memcpy}, this function has undefined results if the strings overlap. The return value is the value of @var{to}. @end deftypefun +@comment wchar.h +@comment ISO +@deftypefun {wchar_t *} wcscpy (wchar_t *restrict @var{wto}, const wchar_t *restrict @var{wfrom}) +This copies wide characters from the string @var{wfrom} (up to and +including the terminating null wide character) into the string +@var{wto}. Like @code{wmemcpy}, this function has undefined results if +the strings overlap. The return value is the value of @var{wto}. +@end deftypefun + @comment string.h @comment ISO -@deftypefun {char *} strncpy (char *@var{to}, const char *@var{from}, size_t @var{size}) +@deftypefun {char *} strncpy (char *restrict @var{to}, const char *restrict @var{from}, size_t @var{size}) This function is similar to @code{strcpy} but always copies exactly @var{size} characters into @var{to}. @@ -351,6 +572,32 @@ In this case, @var{size} may be large, and when it is, @code{strncpy} will waste a considerable amount of time copying null characters. @end deftypefun +@comment wchar.h +@comment ISO +@deftypefun {wchar_t *} wcsncpy (wchar_t *restrict @var{wto}, const wchar_t *restrict @var{wfrom}, size_t @var{size}) +This function is similar to @code{wcscpy} but always copies exactly +@var{size} wide characters into @var{wto}. + +If the length of @var{wfrom} is more than @var{size}, then +@code{wcsncpy} copies just the first @var{size} wide characters. Note +that in this case there is no null terminator written into @var{wto}. + +If the length of @var{wfrom} is less than @var{size}, then +@code{wcsncpy} copies all of @var{wfrom}, followed by enough null wide +characters to add up to @var{size} wide characters in all. This +behavior is rarely useful, but it is specified by the @w{ISO C} +standard. + +The behavior of @code{wcsncpy} is undefined if the strings overlap. + +Using @code{wcsncpy} as opposed to @code{wcscpy} is a way to avoid bugs +relating to writing past the end of the allocated space for @var{wto}. +However, it can also make your program much slower in one common case: +copying a string which is probably small into a potentially large buffer. +In this case, @var{size} may be large, and when it is, @code{wcsncpy} will +waste a considerable amount of time copying null wide characters. +@end deftypefun + @comment string.h @comment SVID @deftypefun {char *} strdup (const char *@var{s}) @@ -361,6 +608,19 @@ for the new string, @code{strdup} returns a null pointer. Otherwise it returns a pointer to the new string. @end deftypefun +@comment wchar.h +@comment GNU +@deftypefun {wchar_t *} wcsdup (const wchar_t *@var{ws}) +This function copies the null-terminated wide character string @var{ws} +into a newly allocated string. The string is allocated using +@code{malloc}; see @ref{Unconstrained Allocation}. If @code{malloc} +cannot allocate space for the new string, @code{wcsdup} returns a null +pointer. Otherwise it returns a pointer to the new wide character +string. + +This function is a GNU extension. +@end deftypefun + @comment string.h @comment GNU @deftypefun {char *} strndup (const char *@var{s}, size_t @var{size}) @@ -380,10 +640,10 @@ terminates the destination string. @comment string.h @comment Unknown origin -@deftypefun {char *} stpcpy (char *@var{to}, const char *@var{from}) +@deftypefun {char *} stpcpy (char *restrict @var{to}, const char *restrict @var{from}) This function is like @code{strcpy}, except that it returns a pointer to the end of the string @var{to} (that is, the address of the terminating -null character) rather than the beginning. +null character @code{to + strlen (from)}) rather than the beginning. For example, this program uses @code{stpcpy} to concatenate @samp{foo} and @samp{bar} to produce @samp{foobar}, which it then prints. @@ -396,12 +656,28 @@ This function is not part of the ISO or POSIX standards, and is not customary on Unix systems, but we did not invent it either. Perhaps it comes from MS-DOG. -Its behavior is undefined if the strings overlap. +Its behavior is undefined if the strings overlap. The function is +declared in @file{string.h}. +@end deftypefun + +@comment wchar.h +@comment GNU +@deftypefun {wchar_t *} wcpcpy (wchar_t *restrict @var{wto}, const wchar_t *restrict @var{wfrom}) +This function is like @code{wcscpy}, except that it returns a pointer to +the end of the string @var{wto} (that is, the address of the terminating +null character @code{wto + strlen (wfrom)}) rather than the beginning. + +This function is not part of ISO or POSIX but was found useful while +developing the GNU C Library itself. + +The behavior of @code{wcpcpy} is undefined if the strings overlap. + +@code{wcpcpy} is a GNU extension and is declared in @file{wchar.h}. @end deftypefun @comment string.h @comment GNU -@deftypefun {char *} stpncpy (char *@var{to}, const char *@var{from}, size_t @var{size}) +@deftypefun {char *} stpncpy (char *restrict @var{to}, const char *restrict @var{from}, size_t @var{size}) This function is similar to @code{stpcpy} but copies always exactly @var{size} characters into @var{to}. @@ -420,7 +696,35 @@ is implemented to be useful in contexts where this behaviour of the This function is not part of ISO or POSIX but was found useful while developing the GNU C Library itself. +Its behaviour is undefined if the strings overlap. The function is +declared in @file{string.h}. +@end deftypefun + +@comment wchar.h +@comment GNU +@deftypefun {wchar_t *} wcpncpy (wchar_t *restrict @var{wto}, const wchar_t *restrict @var{wfrom}, size_t @var{size}) +This function is similar to @code{wcpcpy} but copies always exactly +@var{wsize} characters into @var{wto}. + +If the length of @var{wfrom} is more then @var{size}, then +@code{wcpncpy} copies just the first @var{size} wide characters and +returns a pointer to the wide character directly following the one which +was copied last. Note that in this case there is no null terminator +written into @var{wto}. + +If the length of @var{wfrom} is less than @var{size}, then @code{wcpncpy} +copies all of @var{wfrom}, followed by enough null characters to add up +to @var{size} characters in all. This behaviour is rarely useful, but it +is implemented to be useful in contexts where this behaviour of the +@code{wcsncpy} is used. @code{wcpncpy} returns a pointer to the +@emph{first} written null character. + +This function is not part of ISO or POSIX b |
