regexp(5) regexp(5)
NAME [Toc] [Back]
regexp - regular expression and pattern matching notation definitions
DESCRIPTION [Toc] [Back]
A Regular Expression is a mechanism supported by many utilities for
locating and manipulating patterns in text. Pattern Matching Notation
is used by shells and other utilities for file name expansion. This
manual entry defines two forms of regular expressions: Basic Regular
Expressions and Extended Regular Expressions; and the one form of
Pattern Matching Notation.
BASIC REGULAR EXPRESSIONS [Toc] [Back]
Basic regular expression (RE) notation and construction rules apply to
utilities defined as using basic REs. Any exceptions to the following
rules are noted in the descriptions of the specific utilities that use
REs.
REs Matching a Single Character [Toc] [Back]
The following REs match a single character or a single collating
element:
Ordinary Characters [Toc] [Back]
An ordinary character is an RE that matches itself. An ordinary
character is any character in the supported character set except
newline and the regular expression special characters listed in
Special Characters below. An ordinary character preceded by a
backslash (\) is treated as the ordinary character itself, except when
the character is (, ), {, or }, or the digits 1 through 9 (see REs
Matching Multiple Characters). Matching is based on the bit pattern
used for encoding the character; not on the graphic representation of
the character.
Special Characters [Toc] [Back]
A regular expression special character preceded by a backslash is a
regular expression that matches the special character itself. When
not preceded by a backslash, such characters have special meaning in
the specification of REs. Regular expression special characters and
the contexts in which they have special meaning are:
. [ \ The period, left square bracket, and backslash are
special except when used in a bracket expression
(see RE Bracket Expression).
* The asterisk is special except when used in a
bracket expression, as the first character of a
regular expression, or as the first character
following the character pair \( (see REs Matching
Multiple Characters).
^ The circumflex is special when used as the first
character of an entire RE (see Expression
Hewlett-Packard Company - 1 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
Anchoring) or as the first character of a bracket
expression.
$ The dollar sign is special when used as the last
character of an entire RE (see Expression
Anchoring).
delimiter Any character used to bound (i.e., delimit) an
entire RE is special for that RE.
Period [Toc] [Back]
A period (.), when used outside of a bracket expression, is an RE that
matches any printable or nonprintable character except newline.
RE Bracket Expression [Toc] [Back]
A bracket expression enclosed in square brackets ([ ]) is an RE that
matches a single collating element contained in the nonempty set of
collating elements represented by the bracket expression.
The following rules apply to bracket expressions:
bracket expression
A bracket expression is either a matching list
expression or a non-matching list expression, and
consists of one or more expressions in any order.
Expressions can be: collating elements, collating
symbols, noncollating characters, equivalence
classes, range expressions, or character classes.
The right bracket (]) loses its special meaning
and represents itself in a bracket expression if
it occurs first in the list (after an initial ^,
if any). Otherwise, it terminates the bracket
expression (unless it is the ending right bracket
for a valid collating symbol, equivalence class,
or character class, or it is the collating element
within a collating symbol or equivalence class
expression). The special characters
. * [ \
(period, asterisk, left bracket, and backslash)
lose their special meaning within a bracket
expression.
The character sequences:
[. [= [:
(left-bracket followed by a period, equal-sign or
colon) are special inside a bracket expression and
are used to delimit collating symbols, equivalence
Hewlett-Packard Company - 2 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
class expressions and character class expressions.
These symbols must be followed by a valid
expression and the matching terminating .], =], or
:].
matching list A matching list expression specifies a list that
matches any one of the characters represented in
the list. The first character in the list cannot
be the circumflex. For example, [abc] is an RE
that matches any of a, b, or c.
non-matching list
A non-matching list expression begins with a
circumflex (^), and specifies a list that matches
any character or collating element except newline
and the characters represented in the list. For
example, [^abc] is an RE that matches any
character except newline or a, b, or c. The
circumflex has this special meaning only when it
occurs first in the list, immediately following
the left square bracket.
collating element
A collating element is a sequence of one or more
characters that represents a single element in the
collating sequence as identified via the most
current setting of the locale variable LC_COLLATE
(see setlocale(3C)).
collating symbol
A collating symbol is a collating element enclosed
within bracket-period ([. .]) delimiters.
Multicharacter collating elements must be
represented as collating symbols to distinguish
them from single-character collating elements.
For example, if the string ch is a valid collating
element, then [[.ch.]] is treated as an element
matching the same string of characters, while ch
is treated as a simple list of the characters c
and h. If the string within the bracket-period
delimiters is not a valid collating element in the
current collating sequence definition, the symbol
is treated as an invalid expression.
noncollating character
A noncollating character is a character that is
ignored for collating purposes. By definition,
such characters cannot participate in equivalence
classes or range expressions.
Hewlett-Packard Company - 3 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
equivalence class
An equivalence class expression represents the set
of collating elements belonging to an equivalence
class. It is expressed by enclosing any one of
the collating elements in the equivalence class
within bracket-equal ([= =]) delimiters. For
example, if a and A belong to the same equivalence
class, then [[=a=]b] and [[=A=]b] are each
equivalent to [aAb].
range expression
A range expression represents the set of collating
elements that fall between two elements in the
current collation sequence as defined via the most
current setting of the locale variable LC_COLLATE
(see setlocale(3C)). It is expressed as the
starting point and the ending point separated by a
hyphen (-).
The starting range point and the ending range
point must be a collating element, collating
symbol, or equivalence class expression. An
equivalence class expression used as an end point
of a range expression is interpreted such that all
collating elements within the equivalence class
are included in the range. For example, if the
collating order is A, a, B, b, C, c, ch, D, and d
and the characters A and a belong to the same
equivalence class, then the expression [[=a=]-D]
is treated as [AaBbCc[.ch.]D].
Both starting and ending range points must be
valid collating elements, collating symbols, or
equivalence class expressions, and the ending
range point must collate equal to or higher than
the starting range point; otherwise the expression
is invalid. For example, with the above collating
order and assuming that E is a noncollating
character, then both the expressions [[=A=]-E] and
[d-a] are invalid.
An ending range point can also be the starting
range point in a subsequent range expression.
Each such range expression is evaluated
separately. For example, the bracket expression
[a-m-o] is treated as [a-mm-o].
The hyphen character is treated as itself if it
occurs first (after an initial ^, if any) or last
in the list, or as the rightmost symbol in a range
expression. As examples, the expressions [-ac]
Hewlett-Packard Company - 4 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
and [ac-] are equivalent and match any of the
characters a, c, or -; the expressions [^-ac] and
[^ac-] are equivalent and match any characters
except newline, a, c, or -; the expression [%--]
matches any of the characters in the defined
collating sequence between % and - inclusive; the
expression [--@] matches any of the characters in
the defined collating sequence between - and @
inclusive; and the expression [a--@] is invalid,
assuming - precedes a in the collating sequence.
If a bracket expression must specify both - and ],
the ] must be placed first (after the ^, if any)
and the - last within the bracket expression.
character class
A character class expression represents the set of
characters belonging to a character class, as
defined via the most current setting of the locale
variable LC_CTYPE. It is expressed as a character
class name enclosed within bracket-colon ([: :])
delimiters.
Standard character class expressions supported in
all locales are:
[:alpha:] letters
[:upper:] upper-case letters
[:lower:] lower-case letters
[:digit:] decimal digits
[:xdigit:] hexadecimal digits
[:alnum:] letters or decimal digits
[:space:] characters producing whitespace
in displayed text
[:print:] printing characters
[:punct:] punctuation characters
[:graph:] characters with a visible
representation
[:cntrl:] control characters
Hewlett-Packard Company - 5 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
[:blank:] blank characters
For example, if the locale variable LC_CTYPE is
set to C, the expression [[:upper:]] is equivalent
to [A-Z]. Similarly the expression [[:digit:]] is
same as [0-9].
REs Matching Multiple Characters [Toc] [Back]
The following rules may be used to construct REs matching multiple
characters from REs matching a single character:
RERE The concatenation of REs is an RE that matches the
first encountered concatenation of the strings
matched by each component of the RE. For example,
the RE bc matches the second and third characters
of the string abcdefabcdef.
RE* An RE matching a single character followed by an
asterisk (*) is an RE that matches zero or more
occurrences of the RE preceding the asterisk. The
first encountered string that permits a match is
chosen, and the matched string will encompass the
maximum number of characters permitted by the RE.
For example, in the string abbbcdeabbbbbbcde, both
the RE b*c and the RE bbb*c are matched by the
substring bbbc in the second through fifth
positions. An asterisk as the first character of
an RE loses this special meaning and is treated as
itself.
\(RE\) A subexpression can be defined within an RE by
enclosing it between the character pairs \( and
\). Such a subexpression matches whatever it
would have matched without the \( and \).
Subexpressions can be arbitrarily nested. An
asterisk immediately following the \( loses its
special meaning and is treated as itself. An
asterisk immediately following the \) is treated
as an invalid character.
\n The expression \n matches the same string of
characters as was matched by a subexpression
enclosed between \( and \) preceding the \n. The
character n must be a digit from 1 through 9,
specifying the n-th subexpression (the one that
begins with the n-th \( and ends with the
corresponding paired \). For example, the
expression ^\(.*\)\1$ matches a line consisting of
two adjacent appearances of the same string.
Hewlett-Packard Company - 6 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
If the \n is followed by an asterisk, it matches
zero or more occurrences of the subexpression
referred to. For example, the expression
\(ab\(cd\)ef\)Z\2*Z\1 matches the string
abcdefZcdcdZabcdef.
RE\{m,n\} An RE matching a single character followed by
\{m\}, \{m,\}, or \{m,n\} is an RE that matches
repeated occurrences of the RE. The values of m
and n must be decimal integers in the range 0
through 255, with m specifying the exact or
minimum number of occurrences and n specifying the
maximum number of occurrences. \{m\} matches
exactly m occurrences of the preceding RE, \{m,\}
matches at least m occurrences, and \{m,n\}
matches any number of occurrences between m and n,
inclusive.
The first encountered string that matches the
expression is chosen; it will contain as many
occurrences of the RE as possible. For example,
in the string abbbbbbbc the RE b\{3\} is matched
by characters two through four, the RE b\{3,\} is
matched by characters two through eight, and the
RE b\{3,5\}c is matched by characters four through
nine.
Expression Anchoring [Toc] [Back]
An RE can be limited to matching strings that begin or end a line
(i.e., anchored) according to the following rules:
+ A circumflex (^) as the first character of an RE anchors the
expression to the beginning of a line; only strings starting
at the first character of a line are matched by the RE. For
example, the RE ^ab matches the string ab in the line abcdef,
but not the same string in the line cdefab.
+ A dollar sign ($) as the last character of an RE anchors the
expression to the end of a line; only strings ending at the
last character of a line are matched by the RE. For example,
the RE ab$ matches the string ab in the line cdefab, but not
the same string in the line abcdef.
+ An RE anchored by both ^ and $ matches only strings that are
lines. For example, the RE ^abcdef$ matches only lines
consisting of the string abcdef.
The use of duplication characters (+,*) following anchors is illegal.
EXTENDED REGULAR EXPRESSIONS [Toc] [Back]
The extended regular expression (ERE) notation and construction rules
Hewlett-Packard Company - 7 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
apply to utilities defined as using extended REs. Any exceptions to
the following rules are noted in the descriptions of the specific
utilities using EREs.
EREs Matching a Single Character [Toc] [Back]
The following EREs match a single character or a single collating
element:
Ordinary Characters [Toc] [Back]
An ordinary character is an ERE that matches itself. An ordinary
character is any character in the supported character set except
newline and the regular expression special characters listed in
Special Characters below. An ordinary character preceded by a
backslash (\) is treated as the ordinary character itself. Matching
is based on the bit pattern used for encoding the character, not on
the graphic representation of the character.
Special Characters [Toc] [Back]
A regular expression special character preceded by a backslash is a
regular expression that matches the special character itself. When
not preceded by a backslash, such characters have special meaning in
the specification of EREs. The extended regular expression special
characters and the contexts in which they have their special meaning
are:
. [ \ ( ) * + ? $ |
The period, left square bracket, backslash, left
parenthesis, right parenthesis, asterisk, plus
sign, question mark, dollar sign, and vertical
bar are special except when used in a bracket
expression (see ERE Bracket Expression).
^ The circumflex is special except when used in a
bracket expression in a non-leading position.
delimiter Any character used to bound (i.e., delimit) an
entire ERE is special for that ERE.
Period [Toc] [Back]
A period (.), when used outside of a bracket expression, is an ERE
that matches any printable or nonprintable character except newline.
ERE Bracket Expression [Toc] [Back]
The syntax and rules for ERE bracket expressions are the same as for
RE bracket expressions found above.
EREs Matching Multiple Characters [Toc] [Back]
The following rules may be used to construct EREs matching multiple
characters from EREs matching a single character:
Hewlett-Packard Company - 8 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
EREERE A concatenation of EREs matches the first
encountered concatenation of the strings matched
by each component of the ERE. Such a
concatenation of EREs enclosed in parentheses
matches whatever the concatenation without the
parentheses matches. For example, both the ERE bc
and the ERE (bc) matches the second and third
characters of the string abcdefabcdef. The
longest overall string is matched.
ERE+ The special character plus (+), when following an
ERE matching a single character, or a
concatenation of EREs enclosed in parenthesis, is
an ERE that matches one or more occurrences of the
ERE preceding the plus sign. The string matched
will contain as many occurrences as possible. For
example, the ERE b+c matches the fourth through
seventh characters in the string acabbbcde.
ERE* The special character asterisk (*), when following
an ERE matching a single character, or a
concatenation of EREs enclosed in parenthesis, is
an ERE that matches zero or more occurrences of
the ERE preceding the asterisk. For example, the
ERE b*c matches the first character in the string
cabbbcde. If there is any choice, the longest
left-most string that permits a match is chosen.
For example, the ERE b*cd matches the third
through seventh characters in the string
cabbbcdebbbbbbcdbc.
ERE? The special character question mark (?), when
following an ERE matching a single character, or a
concatenation of EREs enclosed in parenthesis, is
an ERE that matches zero or one occurrences of the
ERE preceding the question mark. The string
matched will contain as many occurrences as
possible. For example, the ERE b?c matches the
second character in the string acabbbcde.
ERE{m,n} interval expression that functions the same way as
basic regular expression syntax, ERE\{m,n\}
Alternation [Toc] [Back]
Two EREs separated by the special character vertical bar (|) matches a
string that is matched by either ERE. For example, the ERE ((ab)|c)d
matches the string abd and the string cd. A vertical bar '|' may not
appear as follows:
may not appear first or last in an ERE.
Hewlett-Packard Company - 9 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
may not appear immediately following a vertical bar.
may not appear after a left parenthesis.
may not appear immediately preceding a right parenthesis.
Precedence [Toc] [Back]
The order of precedence is as follows, from high to low:
[ ] square brackets
* + ? asterisk, plus sign, question mark
^ $ anchoring
concatenation
| alternation
For example, the ERE abba|cde is interpreted as "match either abba or
cde. It does not mean "match abb followed by a or c followed in turn
by de (because concatenation has a higher order of precedence than
alternation).
Expression Anchoring [Toc] [Back]
An ERE can be limited to matching strings that begin or end a line
(i.e., anchored) according to the following rules:
+ A circumflex (^) matches the beginning of a line (anchors the
expression to the beginning of a line). For example, the ERE
^ab matches the string ab in the line abcdef, but not the same
string in the line cdefab.
+ A dollar sign ($) matches the end of a line (anchors the
expression to the end of a line). For example, the ERE ab$
matches the string ab in the line cdefab, but not the same
string in the line abcdef.
+ An ERE anchored by both ^ and $ matches only strings that are
lines. For example, the ERE ^abcdef$ matches only lines
consisting of the string abcdef. Only empty lines match the
ERE ^$.
The use of duplication characters (+,*) following anchors is illegal.
PATTERN MATCHING NOTATION [Toc] [Back]
The following rules apply to pattern matching notation except as noted
in the descriptions of the specific utilities using pattern matching.
Patterns Matching a Single Character [Toc] [Back]
The following patterns match a single character or a single collating
Hewlett-Packard Company - 10 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
element:
Ordinary Characters [Toc] [Back]
An ordinary character is a pattern that matches itself. An ordinary
character is any character in the supported character set except
newline and the pattern matching special characters listed in Special
Characters below. Matching is based on the bit pattern used for
encoding the character, not on the graphic representation of the
character.
Special Characters [Toc] [Back]
A pattern matching special character preceded by a backslash (\) is a
pattern that matches the special character itself. When not preceded
by a backslash, such characters have special meaning in the
specification of patterns. The pattern matching special characters
and the contexts in which they have their special meaning are:
? * [ The question mark, asterisk, and left square
bracket are special except when used in a bracket
expression (see Pattern Bracket Expression).
Question Mark [Toc] [Back]
A question mark (?), when used outside of a bracket expression, is a
pattern that matches any printable or nonprintable character except
newline.
Pattern Bracket Expression [Toc] [Back]
The syntax and rules for pattern bracket expressions are the same as
for RE bracket expressions found above with the following exceptions:
The exclamation point character (!) replaces the circumflex
character (^) in its role in a non-matching list in the regular
expression notation.
The backslash is used as an escape character within bracket
expressions.
Patterns Matching Multiple Characters [Toc] [Back]
The following rules may be used to construct patterns matching
multiple characters from patterns matching a single character:
* The asterisk (*) is a pattern that matches any
string, including the null string.
RERE The concatenation of patterns matching a single
character is a valid pattern that matches the
concatenation of the single characters or
collating elements matched by each of the
concatenated patterns. For example, the pattern
a[bc] matches the string ab and ac.
Hewlett-Packard Company - 11 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
The concatenation of one or more patterns matching
a single character with one or more asterisks is a
valid pattern. In such patterns, each asterisk
matches a string of zero or more characters, up to
the first character that matches the character
following the asterisk in the pattern.
For example, the pattern a*d matches the strings
ad, abd, and abcd; but not the string abc. When
an asterisk is the first or last character in a
pattern, it matches zero or more characters that
precede or follow the characters matched by the
remainder of the pattern. For example, the
pattern a*d* matches the strings ad, abcd, abcdef,
aaaad, and adddd; the pattern *a*d matches the
strings ad, abcd, efabcd, aaaad, and adddd.
Rule Qualification for Patterns Used for Filename Expansion [Toc] [Back]
The rules described above for pattern matching are qualified by the
following rules when the pattern matching notation is used for
filename expansion by sh(1), csh(1), ksh(1), and make(1).
If a filename (including the component of a pathname that follows
the slash (/) character) begins with a period (.), the period
must be explicitly matched by using a period as the first
character of the pattern; it cannot be matched by either the
asterisk special character, the question mark special character,
or a bracket expression. This rule does not apply to make(1).
The slash character in a pathname must be explicitly matched by
using a slash in the pattern; it cannot be matched by either the
asterisk special character, the question mark special character,
or a bracket expression. For make(1) only the part of the
pathname following the last slash character can be matched by a
special character. That is, all special characters preceding the
last slash character lose their special meaning.
Specified patterns are matched against existing filenames and
pathnames, as appropriate. If the pattern matches any existing
filenames or pathnames, the pattern is replaced with those
filenames and pathnames, sorted according to the collating
sequence in effect. If the pattern does not match any existing
filenames or pathnames, the pattern string is left unchanged.
If the pattern begins with a tilde (~) character, all of the
ordinary characters preceding the first slash (or all characters
if there is no slash) are treated as a possible login name. If
the login name is null (i.e., the pattern contains only the tilde
or the tilde is immediately followed by a slash), the tilde is
replaced by a pathname of the process's home directory, followed
by a slash. Otherwise, the combination of tilde and login name
Hewlett-Packard Company - 12 - HP-UX 11i Version 2: August 2003
regexp(5) regexp(5)
are replaced by a pathname of the home directory associated with
the login name, followed by a slash. If the system cannot
identify the login name, the result is implementation-defined.
This rule does not apply to sh(1) or make(1).
If the pattern contains a $ character, variable substitution can
take place. Environmental variables can be embedded within
patterns as:
$name
or:
${name}
Braces are used to guarantee that characters following name are
not interpreted as belonging to name. Substitution occurs in the
order specified only once; that is, the resulting string is not
examined again for new names that occurred because of the
substitution.
Rule Qualification for Patterns Used in the case Command [Toc] [Back]
The rules described above for pattern matching are qualified by the
following rule when the pattern matching notation is used in the case
command of s
.
Multiple alternative patterns in a single clause can be specified
by separating individual patterns with the vertical bar character
(|); strings matching any of the patterns separated this way will
cause the corresponding command list to be selected.
SEE ALSO [Toc] [Back]
ksh(1), sh(1), fnmatch(3C), glob(3C), regcomp(3C), setlocale(3C),
environ(5).
STANDARDS CONFORMANCE [Toc] [Back]
<regexp.h>: AES, SVID2, SVID3, XPG2, XPG3, XPG4
Hewlett-Packard Company - 13 - HP-UX 11i Version 2: August 2003 [ Back ] |