diff options
| author | Roland McGrath <roland@gnu.org> | 1995-02-18 01:27:10 +0000 |
|---|---|---|
| committer | Roland McGrath <roland@gnu.org> | 1995-02-18 01:27:10 +0000 |
| commit | 28f540f45bbacd939bfd07f213bcad2bf730b1bf (patch) | |
| tree | 15f07c4c43d635959c6afee96bde71fb1b3614ee /manual/pattern.texi | |
| download | glibc-28f540f45bbacd939bfd07f213bcad2bf730b1bf.tar.xz glibc-28f540f45bbacd939bfd07f213bcad2bf730b1bf.zip | |
initial import
Diffstat (limited to 'manual/pattern.texi')
| -rw-r--r-- | manual/pattern.texi | 1189 |
1 files changed, 1189 insertions, 0 deletions
diff --git a/manual/pattern.texi b/manual/pattern.texi new file mode 100644 index 0000000000..903aa48073 --- /dev/null +++ b/manual/pattern.texi @@ -0,0 +1,1189 @@ +@node Pattern Matching, I/O Overview, Searching and Sorting, Top +@chapter Pattern Matching + +The GNU C Library provides pattern matching facilities for two kinds of +patterns: regular expressions and file-name wildcards. The library also +provides a facility for expanding variable and command references and +parsing text into words in the way the shell does. + +@menu +* Wildcard Matching:: Matching a wildcard pattern against a single string. +* Globbing:: Finding the files that match a wildcard pattern. +* Regular Expressions:: Matching regular expressions against strings. +* Word Expansion:: Expanding shell variables, nested commands, + arithmetic, and wildcards. + This is what the shell does with shell commands. +@end menu + +@node Wildcard Matching +@section Wildcard Matching + +@pindex fnmatch.h +This section describes how to match a wildcard pattern against a +particular string. The result is a yes or no answer: does the +string fit the pattern or not. The symbols described here are all +declared in @file{fnmatch.h}. + +@comment fnmatch.h +@comment POSIX.2 +@deftypefun int fnmatch (const char *@var{pattern}, const char *@var{string}, int @var{flags}) +This function tests whether the string @var{string} matches the pattern +@var{pattern}. It returns @code{0} if they do match; otherwise, it +returns the nonzero value @code{FNM_NOMATCH}. The arguments +@var{pattern} and @var{string} are both strings. + +The argument @var{flags} is a combination of flag bits that alter the +details of matching. See below for a list of the defined flags. + +In the GNU C Library, @code{fnmatch} cannot experience an ``error''---it +always returns an answer for whether the match succeeds. However, other +implementations of @code{fnmatch} might sometimes report ``errors''. +They would do so by returning nonzero values that are not equal to +@code{FNM_NOMATCH}. +@end deftypefun + +These are the available flags for the @var{flags} argument: + +@table @code +@comment fnmatch.h +@comment GNU +@item FNM_FILE_NAME +Treat the @samp{/} character specially, for matching file names. If +this flag is set, wildcard constructs in @var{pattern} cannot match +@samp{/} in @var{string}. Thus, the only way to match @samp{/} is with +an explicit @samp{/} in @var{pattern}. + +@comment fnmatch.h +@comment POSIX.2 +@item FNM_PATHNAME +This is an alias for @code{FNM_FILE_NAME}; it comes from POSIX.2. We +don't recommend this name because we don't use the term ``pathname'' for +file names. + +@comment fnmatch.h +@comment POSIX.2 +@item FNM_PERIOD +Treat the @samp{.} character specially if it appears at the beginning of +@var{string}. If this flag is set, wildcard constructs in @var{pattern} +cannot match @samp{.} as the first character of @var{string}. + +If you set both @code{FNM_PERIOD} and @code{FNM_FILE_NAME}, then the +special treatment applies to @samp{.} following @samp{/} as well as to +@samp{.} at the beginning of @var{string}. (The shell uses the +@code{FNM_PERIOD} and @code{FNM_FILE_NAME} falgs together for matching +file names.) + +@comment fnmatch.h +@comment POSIX.2 +@item FNM_NOESCAPE +Don't treat the @samp{\} character specially in patterns. Normally, +@samp{\} quotes the following character, turning off its special meaning +(if any) so that it matches only itself. When quoting is enabled, the +pattern @samp{\?} matches only the string @samp{?}, because the question +mark in the pattern acts like an ordinary character. + +If you use @code{FNM_NOESCAPE}, then @samp{\} is an ordinary character. + +@comment fnmatch.h +@comment GNU +@item FNM_LEADING_DIR +Ignore a trailing sequence of characters starting with a @samp{/} in +@var{string}; that is to say, test whether @var{string} starts with a +directory name that @var{pattern} matches. + +If this flag is set, either @samp{foo*} or @samp{foobar} as a pattern +would match the string @samp{foobar/frobozz}. + +@comment fnmatch.h +@comment GNU +@item FNM_CASEFOLD +Ignore case in comparing @var{string} to @var{pattern}. +@end table + +@node Globbing +@section Globbing + +@cindex globbing +The archetypal use of wildcards is for matching against the files in a +directory, and making a list of all the matches. This is called +@dfn{globbing}. + +You could do this using @code{fnmatch}, by reading the directory entries +one by one and testing each one with @code{fnmatch}. But that would be +slow (and complex, since you would have to handle subdirectories by +hand). + +The library provides a function @code{glob} to make this particular use +of wildcards convenient. @code{glob} and the other symbols in this +section are declared in @file{glob.h}. + +@menu +* Calling Glob:: Basic use of @code{glob}. +* Flags for Globbing:: Flags that enable various options in @code{glob}. +@end menu + +@node Calling Glob +@subsection Calling @code{glob} + +The result of globbing is a vector of file names (strings). To return +this vector, @code{glob} uses a special data type, @code{glob_t}, which +is a structure. You pass @code{glob} the address of the structure, and +it fills in the structure's fields to tell you about the results. + +@comment glob.h +@comment POSIX.2 +@deftp {Data Type} glob_t +This data type holds a pointer to a word vector. More precisely, it +records both the address of the word vector and its size. + +@table @code +@item gl_pathc +The number of elements in the vector. + +@item gl_pathv +The address of the vector. This field has type @w{@code{char **}}. + +@item gl_offs +The offset of the first real element of the vector, from its nominal +address in the @code{gl_pathv} field. Unlike the other fields, this +is always an input to @code{glob}, rather than an output from it. + +If you use a nonzero offset, then that many elements at the beginning of +the vector are left empty. (The @code{glob} function fills them with +null pointers.) + +The @code{gl_offs} field is meaningful only if you use the +@code{GLOB_DOOFFS} flag. Otherwise, the offset is always zero +regardless of what is in this field, and the first real element comes at +the beginning of the vector. +@end table +@end deftp + +@comment glob.h +@comment POSIX.2 +@deftypefun int glob (const char *@var{pattern}, int @var{flags}, int (*@var{errfunc}) (const char *@var{filename}, int @var{error-code}), glob_t *@var{vector-ptr}) +The function @code{glob} does globbing using the pattern @var{pattern} +in the current directory. It puts the result in a newly allocated +vector, and stores the size and address of this vector into +@code{*@var{vector-ptr}}. The argument @var{flags} is a combination of +bit flags; see @ref{Flags for Globbing}, for details of the flags. + +The result of globbing is a sequence of file names. The function +@code{glob} allocates a string for each resulting word, then +allocates a vector of type @code{char **} to store the addresses of +these strings. The last element of the vector is a null pointer. +This vector is called the @dfn{word vector}. + +To return this vector, @code{glob} stores both its address and its +length (number of elements, not counting the terminating null pointer) +into @code{*@var{vector-ptr}}. + +Normally, @code{glob} sorts the file names alphabetically before +returning them. You can turn this off with the flag @code{GLOB_NOSORT} +if you want to get the information as fast as possible. Usually it's +a good idea to let @code{glob} sort them---if you process the files in +alphabetical order, the users will have a feel for the rate of progress +that your application is making. + +If @code{glob} succeeds, it returns 0. Otherwise, it returns one +of these error codes: + +@table @code +@comment glob.h +@comment POSIX.2 +@item GLOB_ABORTED +There was an error opening a directory, and you used the flag +@code{GLOB_ERR} or your specified @var{errfunc} returned a nonzero +value. +@iftex +See below +@end iftex +@ifinfo +@xref{Flags for Globbing}, +@end ifinfo +for an explanation of the @code{GLOB_ERR} flag and @var{errfunc}. + +@comment glob.h +@comment POSIX.2 +@item GLOB_NOMATCH +The pattern didn't match any existing files. If you use the +@code{GLOB_NOCHECK} flag, then you never get this error code, because +that flag tells @code{glob} to @emph{pretend} that the pattern matched +at least one file. + +@comment glob.h +@comment POSIX.2 +@item GLOB_NOSPACE +It was impossible to allocate memory to hold the result. +@end table + +In the event of an error, @code{glob} stores information in +@code{*@var{vector-ptr}} about all the matches it has found so far. +@end deftypefun + +@node Flags for Globbing +@subsection Flags for Globbing + +This section describes the flags that you can specify in the +@var{flags} argument to @code{glob}. Choose the flags you want, +and combine them with the C bitwise OR operator @code{|}. + +@table @code +@comment glob.h +@comment POSIX.2 +@item GLOB_APPEND +Append the words from this expansion to the vector of words produced by +previous calls to @code{glob}. This way you can effectively expand +several words as if they were concatenated with spaces between them. + +In order for appending to work, you must not modify the contents of the +word vector structure between calls to @code{glob}. And, if you set +@code{GLOB_DOOFFS} in the first call to @code{glob}, you must also +set it when you append to the results. + +Note that the pointer stored in @code{gl_pathv} may no longer be valid +after you call @code{glob} the second time, because @code{glob} might +have relocated the vector. So always fetch @code{gl_pathv} from the +@code{glob_t} structure after each @code{glob} call; @strong{never} save +the pointer across calls. + +@comment glob.h +@comment POSIX.2 +@item GLOB_DOOFFS +Leave blank slots at the beginning of the vector of words. +The @code{gl_offs} field says how many slots to leave. +The blank slots contain null pointers. + +@comment glob.h +@comment POSIX.2 +@item GLOB_ERR +Give up right away and report an error if there is any difficulty +reading the directories that must be read in order to expand @var{pattern} +fully. Such difficulties might include a directory in which you don't +have the requisite access. Normally, @code{glob} tries its best to keep +on going despite any errors, reading whatever directories it can. + +You can exercise even more control than this by specifying an +error-handler function @var{errfunc} when you call @code{glob}. If +@var{errfunc} is not a null pointer, then @code{glob} doesn't give up +right away when it can't read a directory; instead, it calls +@var{errfunc} with two arguments, like this: + +@smallexample +(*@var{errfunc}) (@var{filename}, @var{error-code}) +@end smallexample + +@noindent +The argument @var{filename} is the name of the directory that +@code{glob} couldn't open or couldn't read, and @var{error-code} is the +@code{errno} value that was reported to @code{glob}. + +If the error handler function returns nonzero, then @code{glob} gives up +right away. Otherwise, it continues. + +@comment glob.h +@comment POSIX.2 +@item GLOB_MARK +If the pattern matches the name of a directory, append @samp{/} to the +directory's name when returning it. + +@comment glob.h +@comment POSIX.2 +@item GLOB_NOCHECK +If the pattern doesn't match any file names, return the pattern itself +as if it were a file name that had been matched. (Normally, when the +pattern doesn't match anything, @code{glob} returns that there were no +matches.) + +@comment glob.h +@comment POSIX.2 +@item GLOB_NOSORT +Don't sort the file names; return them in no particular order. +(In practice, the order will depend on the order of the entries in +the directory.) The only reason @emph{not} to sort is to save time. + +@comment glob.h +@comment POSIX.2 +@item GLOB_NOESCAPE +Don't treat the @samp{\} character specially in patterns. Normally, +@samp{\} quotes the following character, turning off its special meaning +(if any) so that it matches only itself. When quoting is enabled, the +pattern @samp{\?} matches only the string @samp{?}, because the question +mark in the pattern acts like an ordinary character. + +If you use @code{GLOB_NOESCAPE}, then @samp{\} is an ordinary character. + +@code{glob} does its work by calling the function @code{fnmatch} +repeatedly. It handles the flag @code{GLOB_NOESCAPE} by turning on the +@code{FNM_NOESCAPE} flag in calls to @code{fnmatch}. +@end table + +@node Regular Expressions +@section Regular Expression Matching + +The GNU C library supports two interfaces for matching regular +expressions. One is the standard POSIX.2 interface, and the other is +what the GNU system has had for many years. + +Both interfaces are declared in the header file @file{regex.h}. +If you define @w{@code{_POSIX_C_SOURCE}}, then only the POSIX.2 +functions, structures, and constants are declared. +@c !!! we only document the POSIX.2 interface here!! + +@menu +* POSIX Regexp Compilation:: Using @code{regcomp} to prepare to match. +* Flags for POSIX Regexps:: Syntax variations for @code{regcomp}. +* Matching POSIX Regexps:: Using @code{regexec} to match the compiled + pattern that you get from @code{regcomp}. +* Regexp Subexpressions:: Finding which parts of the string were matched. +* Subexpression Complications:: Find points of which parts were matched. +* Regexp Cleanup:: Freeing storage; reporting errors. +@end menu + +@node POSIX Regexp Compilation +@subsection POSIX Regular Expression Compilation + +Before you can actually match a regular expression, you must +@dfn{compile} it. This is not true compilation---it produces a special +data structure, not machine instructions. But it is like ordinary +compilation in that its purpose is to enable you to ``execute'' the +pattern fast. (@xref{Matching POSIX Regexps}, for how to use the +compiled regular expression for matching.) + +There is a special data type for compiled regular expressions: + +@comment regex.h +@comment POSIX.2 +@deftp {Data Type} regex_t +This type of object holds a compiled regular expression. +It is actually a structure. It has just one field that your programs +should look at: + +@table @code +@item re_nsub +This field holds the number of parenthetical subexpressions in the +regular expression that was compiled. +@end table + +There are several other fields, but we don't describe them here, because +only the functions in the library should use them. +@end deftp + +After you create a @code{regex_t} object, you can compile a regular +expression into it by calling @code{regcomp}. + +@comment regex.h +@comment POSIX.2 +@deftypefun int regcomp (regex_t *@var{compiled}, const char *@var{pattern}, int @var{cflags}) +The function @code{regcomp} ``compiles'' a regular expression into a +data structure that you can use with @code{regexec} to match against a +string. The compiled regular expression format is designed for +efficient matching. @code{regcomp} stores it into @code{*@var{compiled}}. + +It's up to you to allocate an object of type @code{regex_t} and pass its +address to @code{regcomp}. + +The argument @var{cflags} lets you specify various options that control +the syntax and semantics of regular expressions. @xref{Flags for POSIX +Regexps}. + +If you use the flag @code{REG_NOSUB}, then @code{regcomp} omits from +the compiled regular expression the information necessary to record +how subexpressions actually match. In this case, you might as well +pass @code{0} for the @var{matchptr} and @var{nmatch} arguments when +you call @code{regexec}. + +If you don't use @code{REG_NOSUB}, then the compiled regular expression +does have the capacity to record how subexpressions match. Also, +@code{regcomp} tells you how many subexpressions @var{pattern} has, by +storing the number in @code{@var{compiled}->re_nsub}. You can use that +value to decide how long an array to allocate to hold information about +subexpression matches. + +@code{regcomp} returns @code{0} if it succeeds in compiling the regular +expression; otherwise, it returns a nonzero error code (see the table +below). You can use @code{regerror} to produce an error message string +describing the reason for a nonzero value; see @ref{Regexp Cleanup}. + +@end deftypefun + +Here are the possible nonzero values that @code{regcomp} can return: + +@table @code +@comment regex.h +@comment POSIX.2 +@item REG_BADBR +There was an invalid @samp{\@{@dots{}\@}} construct in the regular +expression. A valid @samp{\@{@dots{}\@}} construct must contain either +a single number, or two numbers in increasing order separated by a +comma. + +@comment regex.h +@comment POSIX.2 +@item REG_BADPAT +There was a syntax error in the regular expression. + +@comment regex.h +@comment POSIX.2 +@item REG_BADRPT +A repetition operator such as @samp{?} or @samp{*} appeared in a bad +position (with no preceding subexpression to act on). + +@comment regex.h +@comment POSIX.2 +@item REG_ECOLLATE +The regular expression referred to an invalid collating element (one not +defined in the current locale for string collation). @xref{Locale +Categories}. + +@comment regex.h +@comment POSIX.2 +@item REG_ECTYPE +The regular expression referred to an invalid character class name. + +@comment regex.h +@comment POSIX.2 +@item REG_EESCAPE +The regular expression ended with @samp{\}. + +@comment regex.h +@comment POSIX.2 +@item REG_ESUBREG +There was an invalid number in the @samp{\@var{digit}} construct. + +@comment regex.h +@comment POSIX.2 +@item REG_EBRACK +There were unbalanced square brackets in the regular expression. + +@comment regex.h +@comment POSIX.2 +@item REG_EPAREN +An extended regular expression had unbalanced parentheses, +or a basic regular expression had unbalanced @samp{\(} and @samp{\)}. + +@comment regex.h +@comment POSIX.2 +@item REG_EBRACE +The regular expression had unbalanced @samp{\@{} and @samp{\@}}. + +@comment regex.h +@comment POSIX.2 +@item REG_ERANGE +One of the endpoints in a range expression was invalid. + +@comment regex.h +@comment POSIX.2 +@item REG_ESPACE +@code{regcomp} ran out of memory. +@end table + +@node Flags for POSIX Regexps +@subsection Flags for POSIX Regular Expressions + +These are the bit flags that you can use in the @var{cflags} operand when +compiling a regular expression with @code{regcomp}. + +@table @code +@comment regex.h +@comment POSIX.2 +@item REG_EXTENDED +Treat the pattern as an extended regular expression, rather than as a +basic regular expression. + +@comment regex.h +@comment POSIX.2 +@item REG_ICASE +Ignore case when matching letters. + +@comment regex.h +@comment POSIX.2 +@item REG_NOSUB +Don't bother storing the contents of the @var{matches-ptr} array. + +@comment regex.h +@comment POSIX.2 +@item REG_NEWLINE +Treat a newline in @var{string} as dividing @var{string} into multiple +lines, so that @samp{$} can match before the newline and @samp{^} can +match after. Also, don't permit @samp{.} to match a newline, and don't +permit @samp{[^@dots{}]} to match a newline. + +Otherwise, newline acts like any other ordinary character. +@end table + +@node Matching POSIX Regexps +@subsection Matching a Compiled POSIX Regular Expression + +Once you have compiled a regular expression, as described in @ref{POSIX +Regexp Compilation}, you can match it against strings using +@code{regexec}. A match anywhere inside the string counts as success, +unless the regular expression contains anchor characters (@samp{^} or +@samp{$}). + +@comment regex.h +@comment POSIX.2 +@deftypefun int regexec (regex_t *@var{compiled}, char *@var{string}, size_t @var{nmatch}, regmatch_t @var{matchptr} @t{[]}, int @var{eflags}) +This function tries to match the compiled regular expression +@code{*@var{compiled}} against @var{string}. + +@code{regexec} returns @code{0} if the regular expression matches; +otherwise, it returns a nonzero value. See the table below for +what nonzero values mean. You can use @code{regerror} to produce an +error message string describing the reason for a nonzero value; +see @ref{Regexp Cleanup}. + +The argument @var{eflags} is a word of bit flags that enable various +options. + +If you want to get information about what part of @var{string} actually +matched the regular expression or its subexpressions, use the arguments +@var{matchptr} and @var{nmatch}. Otherwise, pass @code{0} for +@var{nmatch}, and @code{NULL} for @var{matchptr}. @xref{Regexp +Subexpressions}. +@end deftypefun + +You must match the regular expression with the same set of current +locales that were in effect when you compiled the regular expression. + +The function @code{regexec} accepts the following flags in the +@var{eflags} argument: + +@table @code +@comment regex.h +@comment POSIX.2 +@item REG_NOTBOL +Do not regard the beginning of the specified string as the beginning of +a line; more generally, don't make any assumptions about what text might +precede it. + +@comment regex.h +@comment POSIX.2 +@item REG_NOTEOL +Do not regard the end of the specified string as the end of a line; more +generally, don't make any assumptions about what text might follow it. +@end table + +Here are the possible nonzero values that @code{regexec} can return: + +@table @code +@comment regex.h +@comment POSIX.2 +@item REG_NOMATCH +The pattern didn't match the string. This isn't really an error. + +@comment regex.h +@comment POSIX.2 +@item REG_ESPACE +@code{regexec} ran out of memory. +@end table + +@node Regexp Subexpressions +@subsection Match Results with Subexpressions + +When @code{regexec} matches parenthetical subexpressions of +@var{pattern}, it records which parts of @var{string} they match. It +returns that information by storing the offsets into an array whose +elements are structures of type @code{regmatch_t}. The first element of +the array (index @code{0}) records the part of the string that matched +the entire regular expression. Each other element of the array records +the beginning and end of the part that matched a single parenthetical +subexpression. + +@comment regex.h +@comment POSIX.2 +@deftp {Data Type} regmatch_t +This is the data type of the @var{matcharray} array that you pass to +@code{regexec}. It containes two structure fields, as follows: + +@table @code +@item rm_so +The offset in @var{string} of the beginning of a substring. Add this +value to @var{string} to get the address of that part. + +@item rm_eo +The offset in @var{string} of the end of the substring. +@end table +@end deftp + +@comment regex.h +@comment POSIX.2 +@deftp {Data Type} regoff_t +@code{regoff_t} is an alias for another signed integer type. +The fields of @code{regmatch_t} have type @code{regoff_t}. +@end deftp + +The @code{regmatch_t} elements correspond to subexpressions +positionally; the first element (index @code{1}) records where the first +subexpression matched, the second element records the second +subexpression, and so on. The order of the subexpressions is the order +in which they begin. + +When you call @code{regexec}, you specify how long the @var{matchptr} +array is, with the @var{nmatch} argument. This tells @code{regexec} how +many elements to store. If the actual regular expression has more than +@var{nmatch} subexpressions, then you won't get offset information about +the rest of them. But this doesn't alter whether the pattern matches a +particular string or not. + +If you don't want @code{regexec} to return any information about where +the subexpressions matched, you can either supply @code{0} for +@var{nmatch}, or use the flag @code{REG_NOSUB} when you compile the +pattern with @code{regcomp}. + +@node Subexpression Complications +@subsection Complications in Subexpression Matching + +Sometimes a subexpression matches a substring of no characters. This +happens when @samp{f\(o*\)} matches the string @samp{fum}. (It really +matches just the @samp{f}.) In this case, both of the offsets identify +the point in the string where the null substring was found. In this +example, the offsets are both @code{1}. + +Sometimes the entire regular expression can match without using some of +its subexpressions at all---for example, when @samp{ba\(na\)*} matches the +string @samp{ba}, the parenthetical subexpression is not used. When +this happens, @code{regexec} stores @code{-1} in both fields of the +element for that subexpression. + +Sometimes matching the entire regular expression can match a particular +subexpression more than once---for example, when @samp{ba\(na\)*} +matches the string @samp{bananana}, the parenthetical subexpression +matches three times. When this happens, @code{regexec} usually st |
