874 lines
28 KiB
Plaintext
874 lines
28 KiB
Plaintext
'\"
|
|
'\" Copyright (c) 1998 Sun Microsystems, Inc.
|
|
'\" Copyright (c) 1999 Scriptics Corporation
|
|
'\"
|
|
'\" See the file "license.terms" for information on usage and redistribution
|
|
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
|
|
'\"
|
|
.so man.macros
|
|
.ie '\w'o''\w'\C'^o''' .ds qo \C'^o'
|
|
.el .ds qo u
|
|
.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
|
|
.BS
|
|
.SH NAME
|
|
re_syntax \- Syntax of Tcl regular expressions
|
|
.BE
|
|
.SH DESCRIPTION
|
|
.PP
|
|
A \fIregular expression\fR describes strings of characters.
|
|
It's a pattern that matches certain strings and does not match others.
|
|
.SH "DIFFERENT FLAVORS OF REs"
|
|
Regular expressions
|
|
.PQ RE s ,
|
|
as defined by POSIX, come in two flavors: \fIextended\fR REs
|
|
.PQ ERE s
|
|
and \fIbasic\fR REs
|
|
.PQ BRE s .
|
|
EREs are roughly those of the traditional \fIegrep\fR, while BREs are
|
|
roughly those of the traditional \fIed\fR. This implementation adds
|
|
a third flavor, \fIadvanced\fR REs
|
|
.PQ ARE s ,
|
|
basically EREs with some significant extensions.
|
|
.PP
|
|
This manual page primarily describes AREs. BREs mostly exist for
|
|
backward compatibility in some old programs; they will be discussed at
|
|
the end. POSIX EREs are almost an exact subset of AREs. Features of
|
|
AREs that are not present in EREs will be indicated.
|
|
.SH "REGULAR EXPRESSION SYNTAX"
|
|
.PP
|
|
Tcl regular expressions are implemented using the package written by
|
|
Henry Spencer, based on the 1003.2 spec and some (not quite all) of
|
|
the Perl5 extensions (thanks, Henry!). Much of the description of
|
|
regular expressions below is copied verbatim from his manual entry.
|
|
.PP
|
|
An ARE is one or more \fIbranches\fR,
|
|
separated by
|
|
.QW \fB|\fR ,
|
|
matching anything that matches any of the branches.
|
|
.PP
|
|
A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
|
|
concatenated.
|
|
It matches a match for the first, followed by a match for the second, etc;
|
|
an empty branch matches the empty string.
|
|
.SS QUANTIFIERS
|
|
A quantified atom is an \fIatom\fR possibly followed
|
|
by a single \fIquantifier\fR.
|
|
Without a quantifier, it matches a single match for the atom.
|
|
The quantifiers,
|
|
and what a so-quantified atom matches, are:
|
|
.RS 2
|
|
.TP 6
|
|
\fB*\fR
|
|
.
|
|
a sequence of 0 or more matches of the atom
|
|
.TP
|
|
\fB+\fR
|
|
.
|
|
a sequence of 1 or more matches of the atom
|
|
.TP
|
|
\fB?\fR
|
|
.
|
|
a sequence of 0 or 1 matches of the atom
|
|
.TP
|
|
\fB{\fIm\fB}\fR
|
|
.
|
|
a sequence of exactly \fIm\fR matches of the atom
|
|
.TP
|
|
\fB{\fIm\fB,}\fR
|
|
.
|
|
a sequence of \fIm\fR or more matches of the atom
|
|
.TP
|
|
\fB{\fIm\fB,\fIn\fB}\fR
|
|
.
|
|
a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
|
|
\fIm\fR may not exceed \fIn\fR
|
|
.TP
|
|
\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
|
|
.
|
|
\fInon-greedy\fR quantifiers, which match the same possibilities,
|
|
but prefer the smallest number rather than the largest number
|
|
of matches (see \fBMATCHING\fR)
|
|
.RE
|
|
.PP
|
|
The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
|
|
numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
|
|
permissible values from 0 to 255 inclusive.
|
|
.SS ATOMS
|
|
An atom is one of:
|
|
.RS 2
|
|
.IP \fB(\fIre\fB)\fR 6
|
|
matches a match for \fIre\fR (\fIre\fR is any regular expression) with
|
|
the match noted for possible reporting
|
|
.IP \fB(?:\fIre\fB)\fR
|
|
as previous, but does no reporting (a
|
|
.QW non-capturing
|
|
set of parentheses)
|
|
.IP \fB()\fR
|
|
matches an empty string, noted for possible reporting
|
|
.IP \fB(?:)\fR
|
|
matches an empty string, without reporting
|
|
.IP \fB[\fIchars\fB]\fR
|
|
a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
|
|
\fBBRACKET EXPRESSIONS\fR for more detail)
|
|
.IP \fB.\fR
|
|
matches any single character
|
|
.IP \fB\e\fIk\fR
|
|
matches the non-alphanumeric character \fIk\fR
|
|
taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
|
|
character
|
|
.IP \fB\e\fIc\fR
|
|
where \fIc\fR is alphanumeric (possibly followed by other characters),
|
|
an \fIescape\fR (AREs only), see \fBESCAPES\fR below
|
|
.IP \fB{\fR
|
|
when followed by a character other than a digit, matches the
|
|
left-brace character
|
|
.QW \fB{\fR ;
|
|
when followed by a digit, it is the beginning of a \fIbound\fR (see above)
|
|
.IP \fIx\fR
|
|
where \fIx\fR is a single character with no other significance,
|
|
matches that character.
|
|
.RE
|
|
.SS CONSTRAINTS
|
|
A \fIconstraint\fR matches an empty string when specific conditions
|
|
are met. A constraint may not be followed by a quantifier. The
|
|
simple constraints are as follows; some more constraints are described
|
|
later, under \fBESCAPES\fR.
|
|
.RS 2
|
|
.TP 8
|
|
\fB^\fR
|
|
.
|
|
matches at the beginning of the string or a line (according to whether
|
|
matching is newline-sensitive or not, as described in \fBMATCHING\fR,
|
|
below).
|
|
.TP
|
|
\fB$\fR
|
|
.
|
|
matches at the end of the string or a line (according to whether
|
|
matching is newline-sensitive or not, as described in \fBMATCHING\fR,
|
|
below).
|
|
.RS
|
|
.PP
|
|
The difference between string and line matching modes is immaterial
|
|
when the string does not contain a newline character. The \fB\eA\fR
|
|
and \fB\eZ\fR constraint escapes have a similar purpose but are
|
|
always constraints for the overall string.
|
|
.PP
|
|
The default newline-sensitivity depends on the command that uses the
|
|
regular expression, and can be overridden as described in
|
|
\fBMETASYNTAX\fR, below.
|
|
.RE
|
|
.TP
|
|
\fB(?=\fIre\fB)\fR
|
|
.
|
|
\fIpositive lookahead\fR (AREs only), matches at any point where a
|
|
substring matching \fIre\fR begins
|
|
.TP
|
|
\fB(?!\fIre\fB)\fR
|
|
.
|
|
\fInegative lookahead\fR (AREs only), matches at any point where no
|
|
substring matching \fIre\fR begins
|
|
.RE
|
|
.PP
|
|
The lookahead constraints may not contain back references (see later),
|
|
and all parentheses within them are considered non-capturing.
|
|
.PP
|
|
An RE may not end with
|
|
.QW \fB\e\fR .
|
|
.SH "BRACKET EXPRESSIONS"
|
|
A \fIbracket expression\fR is a list of characters enclosed in
|
|
.QW \fB[\|]\fR .
|
|
It normally matches any single character from the list
|
|
(but see below). If the list begins with
|
|
.QW \fB^\fR ,
|
|
it matches any single character (but see below) \fInot\fR from the
|
|
rest of the list.
|
|
.PP
|
|
If two characters in the list are separated by
|
|
.QW \fB\-\fR ,
|
|
this is shorthand for the full \fIrange\fR of characters between those two
|
|
(inclusive) in the collating sequence, e.g.
|
|
.QW \fB[0\-9]\fR
|
|
in Unicode matches any conventional decimal digit. Two ranges may not share an
|
|
endpoint, so e.g.
|
|
.QW \fBa\-c\-e\fR
|
|
is illegal. Ranges in Tcl always use the
|
|
Unicode collating sequence, but other programs may use other collating
|
|
sequences and this can be a source of incompatibility between programs.
|
|
.PP
|
|
To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
|
|
method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
|
|
collating element (see below). Alternatively, make it the first
|
|
character (following a possible
|
|
.QW \fB^\fR ),
|
|
or (AREs only) precede it with
|
|
.QW \fB\e\fR .
|
|
Alternatively, for
|
|
.QW \fB\-\fR ,
|
|
make it the last character, or the second endpoint of a range. To use
|
|
a literal \fB\-\fR as the first endpoint of a range, make it a
|
|
collating element or (AREs only) precede it with
|
|
.QW \fB\e\fR .
|
|
With the exception of
|
|
these, some combinations using \fB[\fR (see next paragraphs), and
|
|
escapes, all other special characters lose their special significance
|
|
within a bracket expression.
|
|
.SS "CHARACTER CLASSES"
|
|
Within a bracket expression, the name of a \fIcharacter class\fR
|
|
enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
|
|
characters (not all collating elements!) belonging to that class.
|
|
Standard character classes are:
|
|
.IP \fBalpha\fR 8
|
|
A letter.
|
|
.IP \fBupper\fR 8
|
|
An upper-case letter.
|
|
.IP \fBlower\fR 8
|
|
A lower-case letter.
|
|
.IP \fBdigit\fR 8
|
|
A decimal digit.
|
|
.IP \fBxdigit\fR 8
|
|
A hexadecimal digit.
|
|
.IP \fBalnum\fR 8
|
|
An alphanumeric (letter or digit).
|
|
.IP \fBprint\fR 8
|
|
A "printable" (same as graph, except also including space).
|
|
.IP \fBblank\fR 8
|
|
A space or tab character.
|
|
.IP \fBspace\fR 8
|
|
A character producing white space in displayed text.
|
|
.IP \fBpunct\fR 8
|
|
A punctuation character.
|
|
.IP \fBgraph\fR 8
|
|
A character with a visible representation (includes both \fBalnum\fR
|
|
and \fBpunct\fR).
|
|
.IP \fBcntrl\fR 8
|
|
A control character.
|
|
.PP
|
|
A locale may provide others. A character class may not be used as an endpoint
|
|
of a range.
|
|
.RS
|
|
.PP
|
|
(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
|
|
locale, which supports exactly the above classes.)
|
|
.RE
|
|
.SS "BRACKETED CONSTRAINTS"
|
|
There are two special cases of bracket expressions: the bracket
|
|
expressions
|
|
.QW \fB[[:<:]]\fR
|
|
and
|
|
.QW \fB[[:>:]]\fR
|
|
are constraints, matching empty strings at the beginning and end of a word
|
|
respectively.
|
|
.\" note, discussion of escapes below references this definition of word
|
|
A word is defined as a sequence of word characters that is neither preceded
|
|
nor followed by word characters. A word character is an \fIalnum\fR character
|
|
or an underscore
|
|
.PQ \fB_\fR "" .
|
|
These special bracket expressions are deprecated; users of AREs should use
|
|
constraint escapes instead (see below).
|
|
.SS "COLLATING ELEMENTS"
|
|
Within a bracket expression, a collating element (a character, a
|
|
multi-character sequence that collates as if it were a single
|
|
character, or a collating-sequence name for either) enclosed in
|
|
\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
|
|
collating element. The sequence is a single element of the bracket
|
|
expression's list. A bracket expression in a locale that has
|
|
multi-character collating elements can thus match more than one
|
|
character. So (insidiously), a bracket expression that starts with
|
|
\fB^\fR can match multi-character collating elements even if none of
|
|
them appear in the bracket expression!
|
|
.RS
|
|
.PP
|
|
(\fINote:\fR Tcl has no multi-character collating elements. This information
|
|
is only for illustration.)
|
|
.RE
|
|
.PP
|
|
For example, assume the collating sequence includes a \fBch\fR multi-character
|
|
collating element. Then the RE
|
|
.QW \fB[[.ch.]]*c\fR
|
|
(zero or more
|
|
.QW \fBch\fRs
|
|
followed by
|
|
.QW \fBc\fR )
|
|
matches the first five characters of
|
|
.QW \fBchchcc\fR .
|
|
Also, the RE
|
|
.QW \fB[^c]b\fR
|
|
matches all of
|
|
.QW \fBchb\fR
|
|
(because
|
|
.QW \fB[^c]\fR
|
|
matches the multi-character
|
|
.QW \fBch\fR ).
|
|
.SS "EQUIVALENCE CLASSES"
|
|
Within a bracket expression, a collating element enclosed in \fB[=\fR
|
|
and \fB=]\fR is an equivalence class, standing for the sequences of
|
|
characters of all collating elements equivalent to that one, including
|
|
itself. (If there are no other equivalent collating elements, the
|
|
treatment is as if the enclosing delimiters were
|
|
.QW \fB[.\fR \&
|
|
and
|
|
.QW \fB.]\fR .)
|
|
For example, if \fBo\fR and \fB\(^o\fR are the members of an
|
|
equivalence class, then
|
|
.QW \fB[[=o=]]\fR ,
|
|
.QW \fB[[=\(^o=]]\fR ,
|
|
and
|
|
.QW \fB[o\(^o]\fR \&
|
|
are all synonymous. An equivalence class may not be an endpoint of a range.
|
|
.RS
|
|
.PP
|
|
(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
|
|
equivalence classes. The examples above are just illustrations.)
|
|
.RE
|
|
.SH ESCAPES
|
|
Escapes (AREs only), which begin with a \fB\e\fR followed by an
|
|
alphanumeric character, come in several varieties: character entry,
|
|
class shorthands, constraint escapes, and back references. A \fB\e\fR
|
|
followed by an alphanumeric character but not constituting a valid
|
|
escape is illegal in AREs. In EREs, there are no escapes: outside a
|
|
bracket expression, a \fB\e\fR followed by an alphanumeric character
|
|
merely stands for that character as an ordinary character, and inside
|
|
a bracket expression, \fB\e\fR is an ordinary character. (The latter
|
|
is the one actual incompatibility between EREs and AREs.)
|
|
.SS "CHARACTER-ENTRY ESCAPES"
|
|
Character-entry escapes (AREs only) exist to make it easier to specify
|
|
non-printing and otherwise inconvenient characters in REs:
|
|
.RS 2
|
|
.TP 5
|
|
\fB\ea\fR
|
|
.
|
|
alert (bell) character, as in C
|
|
.TP
|
|
\fB\eb\fR
|
|
.
|
|
backspace, as in C
|
|
.TP
|
|
\fB\eB\fR
|
|
.
|
|
synonym for \fB\e\fR to help reduce backslash doubling in some
|
|
applications where there are multiple levels of backslash processing
|
|
.TP
|
|
\fB\ec\fIX\fR
|
|
.
|
|
(where \fIX\fR is any character) the character whose low-order 5 bits
|
|
are the same as those of \fIX\fR, and whose other bits are all zero
|
|
.TP
|
|
\fB\ee\fR
|
|
.
|
|
the character whose collating-sequence name is
|
|
.QW \fBESC\fR ,
|
|
or failing that, the character with octal value 033
|
|
.TP
|
|
\fB\ef\fR
|
|
.
|
|
formfeed, as in C
|
|
.TP
|
|
\fB\en\fR
|
|
.
|
|
newline, as in C
|
|
.TP
|
|
\fB\er\fR
|
|
.
|
|
carriage return, as in C
|
|
.TP
|
|
\fB\et\fR
|
|
.
|
|
horizontal tab, as in C
|
|
.TP
|
|
\fB\eu\fIwxyz\fR
|
|
.
|
|
(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
|
|
character \fBU+\fIwxyz\fR in the local byte ordering
|
|
.TP
|
|
\fB\eU\fIstuvwxyz\fR
|
|
.
|
|
(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
|
|
for a Unicode extension up to 21 bits. The digits are parsed until the
|
|
first non-hexadecimal character is encountered, the maximun of eight
|
|
hexadecimal digits are reached, or an overflow would occur in the maximum
|
|
value of \fBU+\fI10ffff\fR.
|
|
.TP
|
|
\fB\ev\fR
|
|
.
|
|
vertical tab, as in C are all available.
|
|
.TP
|
|
\fB\ex\fIhh\fR
|
|
.
|
|
(where \fIhh\fR is one or two hexadecimal digits) the character
|
|
whose hexadecimal value is \fB0x\fIhh\fR.
|
|
.TP
|
|
\fB\e0\fR
|
|
.
|
|
the character whose value is \fB0\fR
|
|
.TP
|
|
\fB\e\fIxyz\fR
|
|
.
|
|
(where \fIxyz\fR is exactly three octal digits, and is not a \fIback
|
|
reference\fR (see below)) the character whose octal value is
|
|
\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
|
|
the two-digit form is assumed.
|
|
.TP
|
|
\fB\e\fIxy\fR
|
|
.
|
|
(where \fIxy\fR is exactly two octal digits, and is not a \fIback
|
|
reference\fR (see below)) the character whose octal value is
|
|
\fB0\fIxy\fR
|
|
.RE
|
|
.PP
|
|
Hexadecimal digits are
|
|
.QR \fB0\fR \fB9\fR ,
|
|
.QR \fBa\fR \fBf\fR ,
|
|
and
|
|
.QR \fBA\fR \fBF\fR .
|
|
Octal digits are
|
|
.QR \fB0\fR \fB7\fR .
|
|
.PP
|
|
The character-entry escapes are always taken as ordinary characters.
|
|
For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
|
|
not terminate a bracket expression. Beware, however, that some
|
|
applications (e.g., C compilers and the Tcl interpreter if the regular
|
|
expression is not quoted with braces) interpret such sequences
|
|
themselves before the regular-expression package gets to see them,
|
|
which may require doubling (quadrupling, etc.) the
|
|
.QW \fB\e\fR .
|
|
.SS "CLASS-SHORTHAND ESCAPES"
|
|
Class-shorthand escapes (AREs only) provide shorthands for certain
|
|
commonly-used character classes:
|
|
.RS 2
|
|
.TP 10
|
|
\fB\ed\fR
|
|
.
|
|
\fB[[:digit:]]\fR
|
|
.TP
|
|
\fB\es\fR
|
|
.
|
|
\fB[[:space:]]\fR
|
|
.TP
|
|
\fB\ew\fR
|
|
.
|
|
\fB[[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
|
|
.TP
|
|
\fB\eD\fR
|
|
.
|
|
\fB[^[:digit:]]\fR
|
|
.TP
|
|
\fB\eS\fR
|
|
.
|
|
\fB[^[:space:]]\fR
|
|
.TP
|
|
\fB\eW\fR
|
|
.
|
|
\fB[^[:alnum:]_\eu203F\eu2040\eu2054\euFE33\euFE34\euFE4D\euFE4E\euFE4F\euFF3F]\fR (including punctuation connector characters)
|
|
.RE
|
|
.PP
|
|
Within bracket expressions,
|
|
.QW \fB\ed\fR ,
|
|
.QW \fB\es\fR ,
|
|
and
|
|
.QW \fB\ew\fR \&
|
|
lose their outer brackets, and
|
|
.QW \fB\eD\fR ,
|
|
.QW \fB\eS\fR ,
|
|
and
|
|
.QW \fB\eW\fR \&
|
|
are illegal. (So, for example,
|
|
.QW \fB[a-c\ed]\fR
|
|
is equivalent to
|
|
.QW \fB[a-c[:digit:]]\fR .
|
|
Also,
|
|
.QW \fB[a-c\eD]\fR ,
|
|
which is equivalent to
|
|
.QW \fB[a-c^[:digit:]]\fR ,
|
|
is illegal.)
|
|
.SS "CONSTRAINT ESCAPES"
|
|
A constraint escape (AREs only) is a constraint, matching the empty
|
|
string if specific conditions are met, written as an escape:
|
|
.RS 2
|
|
.TP 6
|
|
\fB\eA\fR
|
|
.
|
|
matches only at the beginning of the string (see \fBMATCHING\fR,
|
|
below, for how this differs from
|
|
.QW \fB^\fR )
|
|
.TP
|
|
\fB\em\fR
|
|
.
|
|
matches only at the beginning of a word
|
|
.TP
|
|
\fB\eM\fR
|
|
.
|
|
matches only at the end of a word
|
|
.TP
|
|
\fB\ey\fR
|
|
.
|
|
matches only at the beginning or end of a word
|
|
.TP
|
|
\fB\eY\fR
|
|
.
|
|
matches only at a point that is not the beginning or end of a word
|
|
.TP
|
|
\fB\eZ\fR
|
|
.
|
|
matches only at the end of the string (see \fBMATCHING\fR, below, for
|
|
how this differs from
|
|
.QW \fB$\fR )
|
|
.TP
|
|
\fB\e\fIm\fR
|
|
.
|
|
(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
|
|
.TP
|
|
\fB\e\fImnn\fR
|
|
.
|
|
(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
|
|
and the decimal value \fImnn\fR is not greater than the number of
|
|
closing capturing parentheses seen so far) a \fIback reference\fR, see
|
|
below
|
|
.RE
|
|
.PP
|
|
A word is defined as in the specification of
|
|
.QW \fB[[:<:]]\fR
|
|
and
|
|
.QW \fB[[:>:]]\fR
|
|
above. Constraint escapes are illegal within bracket expressions.
|
|
.SS "BACK REFERENCES"
|
|
A back reference (AREs only) matches the same string matched by the
|
|
parenthesized subexpression specified by the number, so that (e.g.)
|
|
.QW \fB([bc])\e1\fR
|
|
matches
|
|
.QW \fBbb\fR
|
|
or
|
|
.QW \fBcc\fR
|
|
but not
|
|
.QW \fBbc\fR .
|
|
The subexpression must entirely precede the back reference in the RE.
|
|
Subexpressions are numbered in the order of their leading parentheses.
|
|
Non-capturing parentheses do not define subexpressions.
|
|
.PP
|
|
There is an inherent historical ambiguity between octal
|
|
character-entry escapes and back references, which is resolved by
|
|
heuristics, as hinted at above. A leading zero always indicates an
|
|
octal escape. A single non-zero digit, not followed by another digit,
|
|
is always taken as a back reference. A multi-digit sequence not
|
|
starting with a zero is taken as a back reference if it comes after a
|
|
suitable subexpression (i.e. the number is in the legal range for a
|
|
back reference), and otherwise is taken as octal.
|
|
.SH "METASYNTAX"
|
|
In addition to the main syntax described above, there are some special
|
|
forms and miscellaneous syntactic facilities available.
|
|
.PP
|
|
Normally the flavor of RE being used is specified by
|
|
application-dependent means. However, this can be overridden by a
|
|
\fIdirector\fR. If an RE of any flavor begins with
|
|
.QW \fB***:\fR ,
|
|
the rest of the RE is an ARE. If an RE of any flavor begins with
|
|
.QW \fB***=\fR ,
|
|
the rest of the RE is taken to be a literal string, with
|
|
all characters considered ordinary characters.
|
|
.PP
|
|
An ARE may begin with \fIembedded options\fR: a sequence
|
|
\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
|
|
characters) specifies options affecting the rest of the RE. These
|
|
supplement, and can override, any options specified by the
|
|
application. The available option letters are:
|
|
.RS 2
|
|
.TP 3
|
|
\fBb\fR
|
|
.
|
|
rest of RE is a BRE
|
|
.TP 3
|
|
\fBc\fR
|
|
.
|
|
case-sensitive matching (usual default)
|
|
.TP 3
|
|
\fBe\fR
|
|
.
|
|
rest of RE is an ERE
|
|
.TP 3
|
|
\fBi\fR
|
|
.
|
|
case-insensitive matching (see \fBMATCHING\fR, below)
|
|
.TP 3
|
|
\fBm\fR
|
|
.
|
|
historical synonym for \fBn\fR
|
|
.TP 3
|
|
\fBn\fR
|
|
.
|
|
newline-sensitive matching (see \fBMATCHING\fR, below)
|
|
.TP 3
|
|
\fBp\fR
|
|
.
|
|
partial newline-sensitive matching (see \fBMATCHING\fR, below)
|
|
.TP 3
|
|
\fBq\fR
|
|
.
|
|
rest of RE is a literal
|
|
.PQ quoted
|
|
string, all ordinary characters
|
|
.TP 3
|
|
\fBs\fR
|
|
.
|
|
non-newline-sensitive matching (usual default)
|
|
.TP 3
|
|
\fBt\fR
|
|
.
|
|
tight syntax (usual default; see below)
|
|
.TP 3
|
|
\fBw\fR
|
|
.
|
|
inverse partial newline-sensitive
|
|
.PQ weird
|
|
matching (see \fBMATCHING\fR, below)
|
|
.TP 3
|
|
\fBx\fR
|
|
.
|
|
expanded syntax (see below)
|
|
.RE
|
|
.PP
|
|
Embedded options take effect at the \fB)\fR terminating the sequence.
|
|
They are available only at the start of an ARE, and may not be used
|
|
later within it.
|
|
.PP
|
|
In addition to the usual (\fItight\fR) RE syntax, in which all
|
|
characters are significant, there is an \fIexpanded\fR syntax,
|
|
available in all flavors of RE with the \fB\-expanded\fR switch, or in
|
|
AREs with the embedded x option. In the expanded syntax, white-space
|
|
characters are ignored and all characters between a \fB#\fR and the
|
|
following newline (or the end of the RE) are ignored, permitting
|
|
paragraphing and commenting a complex RE. There are three exceptions
|
|
to that basic rule:
|
|
.IP \(bu 3
|
|
a white-space character or
|
|
.QW \fB#\fR
|
|
preceded by
|
|
.QW \fB\e\fR
|
|
is retained
|
|
.IP \(bu 3
|
|
white space or
|
|
.QW \fB#\fR
|
|
within a bracket expression is retained
|
|
.IP \(bu 3
|
|
white space and comments are illegal within multi-character symbols
|
|
like the ARE
|
|
.QW \fB(?:\fR
|
|
or the BRE
|
|
.QW \fB\e(\fR
|
|
.PP
|
|
Expanded-syntax white-space characters are blank, tab, newline, and
|
|
any character that belongs to the \fIspace\fR character class.
|
|
.PP
|
|
Finally, in an ARE, outside bracket expressions, the sequence
|
|
.QW \fB(?#\fIttt\fB)\fR
|
|
(where \fIttt\fR is any text not containing a
|
|
.QW \fB)\fR )
|
|
is a comment, completely ignored. Again, this is not
|
|
allowed between the characters of multi-character symbols like
|
|
.QW \fB(?:\fR .
|
|
Such comments are more a historical artifact than a useful facility,
|
|
and their use is deprecated; use the expanded syntax instead.
|
|
.PP
|
|
\fINone\fR of these metasyntax extensions is available if the
|
|
application (or an initial
|
|
.QW \fB***=\fR
|
|
director) has specified that the
|
|
user's input be treated as a literal string rather than as an RE.
|
|
.SH MATCHING
|
|
In the event that an RE could match more than one substring of a given
|
|
string, the RE matches the one starting earliest in the string. If
|
|
the RE could match more than one substring starting at that point, its
|
|
choice is determined by its \fIpreference\fR: either the longest
|
|
substring, or the shortest.
|
|
.PP
|
|
Most atoms, and all constraints, have no preference. A parenthesized
|
|
RE has the same preference (possibly none) as the RE. A quantified
|
|
atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
|
|
preference (possibly none) as the atom itself. A quantified atom with
|
|
other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
|
|
\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
|
|
with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
|
|
with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
|
|
the same preference as the first quantified atom in it which has a
|
|
preference. An RE consisting of two or more branches connected by the
|
|
\fB|\fR operator prefers longest match.
|
|
.PP
|
|
Subject to the constraints imposed by the rules for matching the whole
|
|
RE, subexpressions also match the longest or shortest possible
|
|
substrings, based on their preferences, with subexpressions starting
|
|
earlier in the RE taking priority over ones starting later. Note that
|
|
outer subexpressions thus take priority over their component
|
|
subexpressions.
|
|
.PP
|
|
The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
|
|
force longest and shortest preference, respectively, on a
|
|
subexpression or a whole RE.
|
|
.RS
|
|
.PP
|
|
\fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by
|
|
putting \fB{1,1}?\fR after one of the first non-constraint atoms or
|
|
parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the
|
|
placing of this non-greediness override on a suitable range of input texts
|
|
when you are writing a RE if you are using this level of complexity.
|
|
.PP
|
|
For example, this regular expression is non-greedy, and will match the
|
|
shortest substring possible given that
|
|
.QW \fBabc\fR
|
|
will be matched as early as possible (the quantifier does not change that):
|
|
.PP
|
|
.CS
|
|
ab{1,1}?c.*x.*cba
|
|
.CE
|
|
.PP
|
|
The atom
|
|
.QW \fBa\fR
|
|
has no greediness preference, we explicitly give one for
|
|
.QW \fBb\fR ,
|
|
and the remaining quantifiers are overridden to be non-greedy by the preceding
|
|
non-greedy quantifier.
|
|
.RE
|
|
.PP
|
|
Match lengths are measured in characters, not collating elements. An
|
|
empty string is considered longer than no match at all. For example,
|
|
.QW \fBbb*\fR
|
|
matches the three middle characters of
|
|
.QW \fBabbbc\fR ,
|
|
.QW \fB(week|wee)(night|knights)\fR
|
|
matches all ten characters of
|
|
.QW \fBweeknights\fR ,
|
|
when
|
|
.QW \fB(.*).*\fR
|
|
is matched against
|
|
.QW \fBabc\fR
|
|
the parenthesized subexpression matches all three characters, and when
|
|
.QW \fB(a*)*\fR
|
|
is matched against
|
|
.QW \fBbc\fR
|
|
both the whole RE and the parenthesized subexpression match an empty string.
|
|
.PP
|
|
If case-independent matching is specified, the effect is much as if
|
|
all case distinctions had vanished from the alphabet. When an
|
|
alphabetic that exists in multiple cases appears as an ordinary
|
|
character outside a bracket expression, it is effectively transformed
|
|
into a bracket expression containing both cases, so that \fBx\fR
|
|
becomes
|
|
.QW \fB[xX]\fR .
|
|
When it appears inside a bracket expression,
|
|
all case counterparts of it are added to the bracket expression, so
|
|
that
|
|
.QW \fB[x]\fR
|
|
becomes
|
|
.QW \fB[xX]\fR
|
|
and
|
|
.QW \fB[^x]\fR
|
|
becomes
|
|
.QW \fB[^xX]\fR .
|
|
.PP
|
|
If newline-sensitive matching is specified, \fB.\fR and bracket
|
|
expressions using \fB^\fR will never match the newline character (so
|
|
that matches will never cross newlines unless the RE explicitly
|
|
arranges it) and \fB^\fR and \fB$\fR will match the empty string after
|
|
and before a newline respectively, in addition to matching at
|
|
beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
|
|
continue to match beginning or end of string \fIonly\fR.
|
|
.PP
|
|
If partial newline-sensitive matching is specified, this affects
|
|
\fB.\fR and bracket expressions as with newline-sensitive matching,
|
|
but not \fB^\fR and \fB$\fR.
|
|
.PP
|
|
If inverse partial newline-sensitive matching is specified, this
|
|
affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
|
|
not \fB.\fR and bracket expressions. This is not very useful but is
|
|
provided for symmetry.
|
|
.SH "LIMITS AND COMPATIBILITY"
|
|
No particular limit is imposed on the length of REs. Programs
|
|
intended to be highly portable should not employ REs longer than 256
|
|
bytes, as a POSIX-compliant implementation can refuse to accept such
|
|
REs.
|
|
.PP
|
|
The only feature of AREs that is actually incompatible with POSIX EREs
|
|
is that \fB\e\fR does not lose its special significance inside bracket
|
|
expressions. All other ARE features use syntax which is illegal or
|
|
has undefined or unspecified effects in POSIX EREs; the \fB***\fR
|
|
syntax of directors likewise is outside the POSIX syntax for both BREs
|
|
and EREs.
|
|
.PP
|
|
Many of the ARE extensions are borrowed from Perl, but some have been
|
|
changed to clean them up, and a few Perl extensions are not present.
|
|
Incompatibilities of note include
|
|
.QW \fB\eb\fR ,
|
|
.QW \fB\eB\fR ,
|
|
the lack of special treatment for a trailing newline, the addition of
|
|
complemented bracket expressions to the things affected by
|
|
newline-sensitive matching, the restrictions on parentheses and back
|
|
references in lookahead constraints, and the longest/shortest-match
|
|
(rather than first-match) matching semantics.
|
|
.PP
|
|
The matching rules for REs containing both normal and non-greedy
|
|
quantifiers have changed since early beta-test versions of this
|
|
package. (The new rules are much simpler and cleaner, but do not work
|
|
as hard at guessing the user's real intentions.)
|
|
.PP
|
|
Henry Spencer's original 1986 \fIregexp\fR package, still in
|
|
widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
|
|
early version of today's EREs. There are four incompatibilities
|
|
between \fIregexp\fR's near-EREs
|
|
.PQ RREs " for short"
|
|
and AREs. In roughly increasing order of significance:
|
|
.IP \(bu 3
|
|
In AREs, \fB\e\fR followed by an alphanumeric character is either an
|
|
escape or an error, while in RREs, it was just another way of writing
|
|
the alphanumeric. This should not be a problem because there was no
|
|
reason to write such a sequence in RREs.
|
|
.IP \(bu 3
|
|
\fB{\fR followed by a digit in an ARE is the beginning of a bound,
|
|
while in RREs, \fB{\fR was always an ordinary character. Such
|
|
sequences should be rare, and will often result in an error because
|
|
following characters will not look like a valid bound.
|
|
.IP \(bu 3
|
|
In AREs, \fB\e\fR remains a special character within
|
|
.QW \fB[\|]\fR ,
|
|
so a literal \fB\e\fR within \fB[\|]\fR must be written
|
|
.QW \fB\e\e\fR .
|
|
\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
|
|
but only truly paranoid programmers routinely doubled the backslash.
|
|
.IP \(bu 3
|
|
AREs report the longest/shortest match for the RE, rather than the
|
|
first found in a specified search order. This may affect some RREs
|
|
which were written in the expectation that the first match would be
|
|
reported. (The careful crafting of RREs to optimize the search order
|
|
for fast matching is obsolete (AREs examine all possible matches in
|
|
parallel, and their performance is largely insensitive to their
|
|
complexity) but cases where the search order was exploited to
|
|
deliberately find a match which was \fInot\fR the longest/shortest
|
|
will need rewriting.)
|
|
.SH "BASIC REGULAR EXPRESSIONS"
|
|
BREs differ from EREs in several respects.
|
|
.QW \fB|\fR ,
|
|
.QW \fB+\fR ,
|
|
and \fB?\fR are ordinary characters and there is no equivalent for their
|
|
functionality. The delimiters for bounds are \fB\e{\fR and
|
|
.QW \fB\e}\fR ,
|
|
with \fB{\fR and \fB}\fR by themselves ordinary characters. The
|
|
parentheses for nested subexpressions are \fB\e(\fR and
|
|
.QW \fB\e)\fR ,
|
|
with \fB(\fR and \fB)\fR by themselves ordinary
|
|
characters. \fB^\fR is an ordinary character except at the beginning
|
|
of the RE or the beginning of a parenthesized subexpression, \fB$\fR
|
|
is an ordinary character except at the end of the RE or the end of a
|
|
parenthesized subexpression, and \fB*\fR is an ordinary character if
|
|
it appears at the beginning of the RE or the beginning of a
|
|
parenthesized subexpression (after a possible leading
|
|
.QW \fB^\fR ).
|
|
Finally, single-digit back references are available, and \fB\e<\fR and
|
|
\fB\e>\fR are synonyms for
|
|
.QW \fB[[:<:]]\fR
|
|
and
|
|
.QW \fB[[:>:]]\fR
|
|
respectively; no other escapes are available.
|
|
.SH "SEE ALSO"
|
|
RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
|
|
.SH KEYWORDS
|
|
match, regular expression, string
|
|
.\" Local Variables:
|
|
.\" mode: nroff
|
|
.\" End:
|