*nix Documentation Project
·  Home
 +   man pages
·  Linux HOWTOs
·  FreeBSD Tips
·  *niX Forums

  man pages->IRIX man pages -> perlre (1)              


PERLRE(1)							     PERLRE(1)

NAME    [Toc]    [Back]

     perlre - Perl regular expressions

DESCRIPTION    [Toc]    [Back]

     This page describes the syntax of regular expressions in Perl.  For a
     description of how	to use regular expressions in matching operations,
     plus various examples of the same,	see m//	and s/// in the	perlop

     The matching operations can have various modifiers.  The modifiers	which
     relate to the interpretation of the regular expression inside are listed
     below.  For the modifiers that alter the behaviour	of the operation, see
     the section on m//	in the perlop manpage and the section on s// in	the
     perlop manpage.

     i	 Do case-insensitive pattern matching.

	 If use	locale is in effect, the case map is taken from	the current
	 locale.  See the perllocale manpage.

     m	 Treat string as multiple lines.  That is, change "^" and "$" from
	 matching at only the very start or end	of the string to the start or
	 end of	any line anywhere within the string,

     s	 Treat string as single	line.  That is,	change "." to match any
	 character whatsoever, even a newline, which it	normally would not

     x	 Extend	your pattern's legibility by permitting	whitespace and

     These are usually written as "the /x modifier", even though the delimiter
     in	question might not actually be a slash.	 In fact, any of these
     modifiers may also	be embedded within the regular expression itself using
     the new (?...) construct.	See below.

     The /x modifier itself needs a little more	explanation.  It tells the
     regular expression	parser to ignore whitespace that is neither
     backslashed nor within a character	class.	You can	use this to break up
     your regular expression into (slightly) more readable parts.  The #
     character is also treated as a metacharacter introducing a	comment, just
     as	in ordinary Perl code.	This also means	that if	you want real
     whitespace	or # characters	in the pattern that you'll have	to either
     escape them or encode them	using octal or hex escapes.  Taken together,
     these features go a long way towards making Perl's	regular	expressions
     more readable.  See the C comment deletion	code in	the perlop manpage.

     Regular Expressions    [Toc]    [Back]

     The patterns used in pattern matching are regular expressions such	as
     those supplied in the Version 8 regexp routines.  (In fact, the routines
     are derived (distantly) from Henry	Spencer's freely redistributable

									Page 1

PERLRE(1)							     PERLRE(1)

     reimplementation of the V8	routines.)  See	the section on Version 8
     Regular Expressions for details.

     In	particular the following metacharacters	have their standard egrep-ish

	 \   Quote the next metacharacter
	 ^   Match the beginning of the	line
	 .   Match any character (except newline)
	 $   Match the end of the line (or before newline at the end)
	 |   Alternation
	 ()  Grouping
	 []  Character class

     By	default, the "^" character is guaranteed to match at only the
     beginning of the string, the "$" character	at only	the end	(or before the
     newline at	the end) and Perl does certain optimizations with the
     assumption	that the string	contains only one line.	 Embedded newlines
     will not be matched by "^"	or "$".	 You may, however, wish	to treat a
     string as a multi-line buffer, such that the "^" will match after any
     newline within the	string,	and "$"	will match before any newline.	At the
     cost of a little more overhead, you can do	this by	using the /m modifier
     on	the pattern match operator.  (Older programs did this by setting $*,
     but this practice is now deprecated.)

     To	facilitate multi-line substitutions, the "." character never matches a
     newline unless you	use the	/s modifier, which in effect tells Perl	to
     pretend the string	is a single line--even if it isn't.  The /s modifier
     also overrides the	setting	of $*, in case you have	some (badly behaved)
     older code	that sets it in	another	module.

     The following standard quantifiers	are recognized:

	 *	Match 0	or more	times
	 +	Match 1	or more	times
	 ?	Match 1	or 0 times
	 {n}	Match exactly n	times
	 {n,}	Match at least n times
	 {n,m}	Match at least n but not more than m times

     (If a curly bracket occurs	in any other context, it is treated as a
     regular character.)  The "*" modifier is equivalent to {0,}, the "+"
     modifier to {1,}, and the "?" modifier to {0,1}.  n and m are limited to
     integral values less than 65536.

     By	default, a quantified subpattern is "greedy", that is, it will match
     as	many times as possible (given a	particular starting location) while
     still allowing the	rest of	the pattern to match.  If you want it to match
     the minimum number	of times possible, follow the quantifier with a	"?".
     Note that the meanings don't change, just the "greediness":

									Page 2

PERLRE(1)							     PERLRE(1)

	 *?	Match 0	or more	times
	 +?	Match 1	or more	times
	 ??	Match 0	or 1 time
	 {n}?	Match exactly n	times
	 {n,}?	Match at least n times
	 {n,m}?	Match at least n but not more than m times

     Because patterns are processed as double quoted strings, the following
     also work:

	 \t	     tab		   (HT,	TAB)
	 \n	     newline		   (LF,	NL)
	 \r	     return		   (CR)
	 \f	     form feed		   (FF)
	 \a	     alarm (bell)	   (BEL)
	 \e	     escape (think troff)  (ESC)
	 \033	     octal char	(think of a PDP-11)
	 \x1B	     hex char
	 \c[	     control char
	 \l	     lowercase next char (think	vi)
	 \u	     uppercase next char (think	vi)
	 \L	     lowercase till \E (think vi)
	 \U	     uppercase till \E (think vi)
	 \E	     end case modification (think vi)
	 \Q	     quote (disable) regexp metacharacters till	\E

     If	use locale is in effect, the case map used by \l, \L, \u and <\U> is
     taken from	the current locale.  See the perllocale	manpage.

     In	addition, Perl defines the following:

	 \w  Match a "word" character (alphanumeric plus "_")
	 \W  Match a non-word character
	 \s  Match a whitespace	character
	 \S  Match a non-whitespace character
	 \d  Match a digit character
	 \D  Match a non-digit character

     Note that \w matches a single alphanumeric	character, not a whole word.
     To	match a	word you'd need	to say \w+.  If	use locale is in effect, the
     list of alphabetic	characters generated by	\w is taken from the current
     locale.  See the perllocale manpage. You may use \w, \W, \s, \S, \d, and
     \D	within character classes (though not as	either end of a	range).

     Perl defines the following	zero-width assertions:

	 \b  Match a word boundary
	 \B  Match a non-(word boundary)
	 \A  Match at only beginning of	string
	 \Z  Match at only end of string (or before newline at the end)
	 \G  Match only	where previous m//g left off (works only with /g)

									Page 3

PERLRE(1)							     PERLRE(1)

     A word boundary (\b) is defined as	a spot between two characters that has
     a \w on one side of it and	a \W on	the other side of it (in either
     order), counting the imaginary characters off the beginning and end of
     the string	as matching a \W.  (Within character classes \b	represents
     backspace rather than a word boundary.)  The \A and \Z are	just like "^"
     and "$" except that they won't match multiple times when the /m modifier
     is	used, while "^"	and "$"	will match at every internal line boundary.
     To	match the actual end of	the string, not	ignoring newline, you can use
     \Z(?!\n).	The \G assertion can be	used to	chain global matches (using
     m//g), as described in the	section	on Regexp Quote-Like Operators in the
     perlop manpage.

     It	is also	useful when writing lex-like scanners, when you	have several
     regexps which you want to match against consequent	substrings of your
     string, see the previous reference.  The actual location where \G will
     match can also be influenced by using pos() as an lvalue.	See the	pos
     entry in the perlfunc manpage.

     When the bracketing construct ( ... ) is used, \<digit> matches the
     digit'th substring.  Outside of the pattern, always use "$" instead of
     "\" in front of the digit.	 (While	the \<digit> notation can on rare
     occasion work outside the current pattern,	this should not	be relied
     upon.  See	the WARNING below.) The	scope of $<digit> (and $`, $&, and $')
     extends to	the end	of the enclosing BLOCK or eval string, or to the next
     successful	pattern	match, whichever comes first.  If you want to use
     parentheses to delimit a subpattern (e.g.,	a set of alternatives) without
     saving it as a subpattern,	follow the ( with a ?:.

     You may have as many parentheses as you wish.  If you have	more than 9
     substrings, the variables $10, $11, ... refer to the corresponding
     substring.	 Within	the pattern, \10, \11, etc. refer back to substrings
     if	there have been	at least that many left	parentheses before the
     backreference.  Otherwise (for backward compatibility) \10	is the same as
     \010, a backspace,	and \11	the same as \011, a tab.  And so on.  (\1
     through \9	are always backreferences.)

     $+	returns	whatever the last bracket match	matched.  $& returns the
     entire matched string.  ($0 used to return	the same thing,	but not	any
     more.)  $`	returns	everything before the matched string.  $' returns
     everything	after the matched string.  Examples:

	 s/^([^	]*) *([^ ]*)/$2	$1/;	 # swap	first two words

	 if (/Time: (..):(..):(..)/) {
	     $hours = $1;
	     $minutes =	$2;
	     $seconds =	$3;

     Once perl sees that you need one of $&, $`	or $' anywhere in the program,
     it	has to provide them on each and	every pattern match.  This can slow
     your program down.	 The same mechanism that handles these provides	for

									Page 4

PERLRE(1)							     PERLRE(1)

     the use of	$1, $2,	etc., so you pay the same price	for each regexp	that
     contains capturing	parentheses. But if you	never use $&, etc., in your
     script, then regexps without capturing parentheses	won't be penalized. So
     avoid $&, $', and $` if you can, but if you can't (and some algorithms
     really appreciate them), once you've used them once, use them at will,
     because you've already paid the price.

     You will note that	all backslashed	metacharacters in Perl are
     alphanumeric, such	as \b, \w, \n.	Unlike some other regular expression
     languages,	there are no backslashed symbols that aren't alphanumeric.  So
     anything that looks like \\, \(, \), \<, \>, \{, or \} is always
     interpreted as a literal character, not a metacharacter.  This was	once
     used in a common idiom to disable or quote	the special meanings of
     regular expression	metacharacters in a string that	you want to use	for a
     pattern. Simply quote all the non-alphanumeric characters:

	 $pattern =~ s/(\W)/\\$1/g;

     Now it is much more common	to see either the quotemeta() function or the
     \Q	escape sequence	used to	disable	the metacharacters special meanings
     like this:


     Perl defines a consistent extension syntax	for regular expressions.  The
     syntax is a pair of parentheses with a question mark as the first thing
     within the	parentheses (this was a	syntax error in	older versions of
     Perl).  The character after the question mark gives the function of the
     extension.	 Several extensions are	already	supported:

     (?#text)  A comment.  The text is ignored.	 If the	/x switch is used to
	       enable whitespace formatting, a simple #	will suffice.

	       This groups things like "()" but	doesn't	make backreferences
	       like "()" does.	So


	       is like


	       but doesn't spit	out extra fields.

	       A zero-width positive lookahead assertion.  For example,
	       /\w+(?=\t)/ matches a word followed by a	tab, without including
	       the tab in $&.

									Page 5

PERLRE(1)							     PERLRE(1)

	       A zero-width negative lookahead assertion.  For example
	       /foo(?!bar)/ matches any	occurrence of "foo" that isn't
	       followed	by "bar".  Note	however	that lookahead and lookbehind
	       are NOT the same	thing.	You cannot use this for	lookbehind:
	       /(?!foo)bar/ will not find an occurrence	of "bar" that is
	       preceded	by something which is not "foo".  That's because the
	       (?!foo) is just saying that the next thing cannot be "foo"--and
	       it's not, it's a	"bar", so "foobar" will	match.	You would have
	       to do something like /(?!foo)...bar/ for	that.	We say "like"
	       because there's the case	of your	"bar" not having three
	       characters before it.  You could	cover that this	way:
	       /(?:(?!foo)...|^..?)bar/.  Sometimes it's still easier just to

		   if (/foo/ &&	$` =~ /bar$/)

     (?imsx)   One or more embedded pattern-match modifiers.  This is
	       particularly useful for patterns	that are specified in a	table
	       somewhere, some of which	want to	be case	sensitive, and some of
	       which don't.  The case insensitive ones need to include merely
	       (?i) at the front of the	pattern.  For example:

		   $pattern = "foobar";
		   if (	/$pattern/i )

		   # more flexible:

		   $pattern = "(?i)foobar";
		   if (	/$pattern/ )

     The specific choice of question mark for this and the new minimal
     matching construct	was because 1) question	mark is	pretty rare in older
     regular expressions, and 2) whenever you see one, you should stop and
     "question"	exactly	what is	going on.  That's psychology...

     Backtracking    [Toc]    [Back]

     A fundamental feature of regular expression matching involves the notion
     called backtracking.  which is used (when needed) by all regular
     expression	quantifiers, namely *, *?, +, +?, {n,m}, and {n,m}?.

     For a regular expression to match,	the entire regular expression must
     match, not	just part of it.  So if	the beginning of a pattern containing
     a quantifier succeeds in a	way that causes	later parts in the pattern to
     fail, the matching	engine backs up	and recalculates the beginning part--
     that's why	it's called backtracking.

									Page 6

PERLRE(1)							     PERLRE(1)

     Here is an	example	of backtracking:  Let's	say you	want to	find the word
     following "foo" in	the string "Food is on the foo table.":

	 $_ = "Food is on the foo table.";
	 if ( /\b(foo)\s+(\w+)/i ) {
	     print "$2 follows $1.\n";

     When the match runs, the first part of the	regular	expression (\b(foo))
     finds a possible match right at the beginning of the string, and loads up
     $1	with "Foo".  However, as soon as the matching engine sees that there's
     no	whitespace following the "Foo" that it had saved in $1,	it realizes
     its mistake and starts over again one character after where it had	the
     tentative match.  This time it goes all the way until the next occurrence
     of	"foo". The complete regular expression matches this time, and you get
     the expected output of "table follows foo."

     Sometimes minimal matching	can help a lot.	 Imagine you'd like to match
     everything	between	"foo" and "bar".  Initially, you write something like

	 $_ =  "The food is under the bar in the barn.";
	 if ( /foo(.*)bar/ ) {
	     print "got	<$1>\n";

     Which perhaps unexpectedly	yields:

       got <d is under the bar in the >

     That's because .* was greedy, so you get everything between the first
     "foo" and the last	"bar".	In this	case, it's more	effective to use
     minimal matching to make sure you get the text between a "foo" and	the
     first "bar" thereafter.

	 if ( /foo(.*?)bar/ ) {	print "got <$1>\n" }
       got <d is under the >

     Here's another example: let's say you'd like to match a number at the end
     of	a string, and you also want to keep the	preceding part the match.  So
     you write this:

	 $_ = "I have 2	numbers: 53147";
	 if ( /(.*)(\d*)/ ) {				     # Wrong!
	     print "Beginning is <$1>, number is <$2>.\n";

     That won't	work at	all, because .*	was greedy and gobbled up the whole
     string. As	\d* can	match on an empty string the complete regular
     expression	matched	successfully.

									Page 7

PERLRE(1)							     PERLRE(1)

	 Beginning is <I have 2	numbers: 53147>, number	is <>.

     Here are some variants, most of which don't work:

	 $_ = "I have 2	numbers: 53147";
	 @pats = qw{

	 for $pat (@pats) {
	     printf "%-12s ", $pat;
	     if	( /$pat/ ) {
		 print "<$1> <$2>\n";
	     } else {
		 print "FAIL\n";

     That will print out:

	 (.*)(\d*)    <I have 2	numbers: 53147>	<>
	 (.*)(\d+)    <I have 2	numbers: 5314> <7>
	 (.*?)(\d*)   <> <>
	 (.*?)(\d+)   <I have >	<2>
	 (.*)(\d+)$   <I have 2	numbers: 5314> <7>
	 (.*?)(\d+)$  <I have 2	numbers: > <53147>
	 (.*)\b(\d+)$ <I have 2	numbers: > <53147>
	 (.*\D)(\d+)$ <I have 2	numbers: > <53147>

     As	you see, this can be a bit tricky.  It's important to realize that a
     regular expression	is merely a set	of assertions that gives a definition
     of	success.  There	may be 0, 1, or	several	different ways that the
     definition	might succeed against a	particular string.  And	if there are
     multiple ways it might succeed, you need to understand backtracking to
     know which	variety	of success you will achieve.

     When using	lookahead assertions and negations, this can all get even
     tricker.  Imagine you'd like to find a sequence of	non-digits not
     followed by "123".	 You might try to write	that as

	     $_	= "ABC123";
	     if	( /^\D*(?!123)/	) {			     # Wrong!
		 print "Yup, no	123 in $_\n";

									Page 8

PERLRE(1)							     PERLRE(1)

     But that isn't going to match; at least, not the way you're hoping.  It
     claims that there is no 123 in the	string.	 Here's	a clearer picture of
     why it that pattern matches, contrary to popular expectations:

	 $x = 'ABC123' ;
	 $y = 'ABC445' ;

	 print "1: got $1\n" if	$x =~ /^(ABC)(?!123)/ ;
	 print "2: got $1\n" if	$y =~ /^(ABC)(?!123)/ ;

	 print "3: got $1\n" if	$x =~ /^(\D*)(?!123)/ ;
	 print "4: got $1\n" if	$y =~ /^(\D*)(?!123)/ ;

     This prints

	 2: got	ABC
	 3: got	AB
	 4: got	ABC

     You might have expected test 3 to fail because it seems to	a more general
     purpose version of	test 1.	 The important difference between them is that
     test 3 contains a quantifier (\D*)	and so can use backtracking, whereas
     test 1 will not.  What's happening	is that	you've asked "Is it true that
     at	the start of $x, following 0 or	more non-digits, you have something
     that's not	123?"  If the pattern matcher had let \D* expand to "ABC",
     this would	have caused the	whole pattern to fail.	The search engine will
     initially match \D* with "ABC".  Then it will try to match	(?!123 with
     "123" which, of course, fails.  But because a quantifier (\D*) has	been
     used in the regular expression, the search	engine can backtrack and retry
     the match differently in the hope of matching the complete	regular

     Well now, the pattern really, really wants	to succeed, so it uses the
     standard regexp back-off-and-retry	and lets \D* expand to just "AB" this
     time.  Now	there's	indeed something following "AB"	that is	not "123".
     It's in fact "C123", which	suffices.

     We	can deal with this by using both an assertion and a negation.  We'll
     say that the first	part in	$1 must	be followed by a digit,	and in fact,
     it	must also be followed by something that's not "123".  Remember that
     the lookaheads are	zero-width expressions--they only look,	but don't
     consume any of the	string in their	match.	So rewriting this way produces
     what you'd	expect;	that is, case 5	will fail, but case 6 succeeds:

	 print "5: got $1\n" if	$x =~ /^(\D*)(?=\d)(?!123)/ ;
	 print "6: got $1\n" if	$y =~ /^(\D*)(?=\d)(?!123)/ ;

	 6: got	ABC

     In	other words, the two zero-width	assertions next	to each	other work
     like they're ANDed	together, just as you'd	use any	builtin	assertions:
     /^$/ matches only if you're at the	beginning of the line AND the end of

									Page 9

PERLRE(1)							     PERLRE(1)

     the line simultaneously.  The deeper underlying truth is that
     juxtaposition in regular expressions always means AND, except when	you
     write an explicit OR using	the vertical bar.  /ab/	means match "a"	AND
     (then) match "b", although	the attempted matches are made at different
     positions because "a" is not a zero-width assertion, but a	one-width

     One warning: particularly complicated regular expressions can take
     exponential time to solve due to the immense number of possible ways they
     can use backtracking to try match.	 For example this will take a very
     long time to run


     And if you	used *'s instead of limiting it	to 0 through 5 matches,	then
     it	would take literally forever--or until you ran out of stack space.

     Version 8 Regular Expressions    [Toc]    [Back]

     In	case you're not	familiar with the "regular" Version 8 regexp routines,
     here are the pattern-matching rules not described above.

     Any single	character matches itself, unless it is a metacharacter with a
     special meaning described here or above.  You can cause characters	which
     normally function as metacharacters to be interpreted literally by
     prefixing them with a "\" (e.g., "\." matches a ".", not any character;
     "\\" matches a "\").  A series of characters matches that series of
     characters	in the target string, so the pattern blurfl would match
     "blurfl" in the target string.

     You can specify a character class,	by enclosing a list of characters in
     [], which will match any one of the characters in the list.  If the first
     character after the "[" is	"^", the class matches any character not in
     the list.	Within a list, the "-" character is used to specify a range,
     so	that a-z represents all	the characters between "a" and "z", inclusive.
     If	you want "-" itself to be a member of a	class, put it at the start or
     end of the	list, or escape	it with	a backslash.  (The following all
     specify the same class of three characters: [-az],	[az-], and [a\-z].
     All are different from [a-z], which specifies a class containing twentysix

     Characters	may be specified using a metacharacter syntax much like	that
     used in C:	"\n" matches a newline,	"\t" a tab, "\r" a carriage return,
     "\f" a form feed, etc.  More generally, \nnn, where nnn is	a string of
     octal digits, matches the character whose ASCII value is nnn.  Similarly,
     \xnn, where nn are	hexadecimal digits, matches the	character whose	ASCII
     value is nn. The expression \cx matches the ASCII character control-x.
     Finally, the "." metacharacter matches any	character except "\n" (unless
     you use /s).

								       Page 10

PERLRE(1)							     PERLRE(1)

     You can specify a series of alternatives for a pattern using "|" to
     separate them, so that fee|fie|foe	will match any of "fee", "fie",	or
     "foe" in the target string	(as would f(e|i|o)e).  Note that the first
     alternative includes everything from the last pattern delimiter ("(",
     "[", or the beginning of the pattern) up to the first "|",	and the	last
     alternative contains everything from the last "|" to the next pattern
     delimiter.	 For this reason, it's common practice to include alternatives
     in	parentheses, to	minimize confusion about where they start and end.
     Note however that "|" is interpreted as a literal with square brackets,
     so	if you write [fee|fie|foe] you're really only matching [feio|].

     Within a pattern, you may designate subpatterns for later reference by
     enclosing them in parentheses, and	you may	refer back to the nth
     subpattern	later in the pattern using the metacharacter \n.  Subpatterns
     are numbered based	on the left to right order of their opening
     parenthesis.  Note	that a backreference matches whatever actually matched
     the subpattern in the string being	examined, not the rules	for that
     subpattern.  Therefore, (0|0x)\d*\s\1\d* will match "0x1234 0x4321",but
     not "0x1234 01234", because subpattern 1 actually matched "0x", even
     though the	rule 0|0x could	potentially match the leading 0	in the second

     WARNING on	\1 vs $1

     Some people get too used to writing things	like

	 $pattern =~ s/(\W)/\\\1/g;

     This is grandfathered for the RHS of a substitute to avoid	shocking the
     sed addicts, but it's a dirty habit to get	into.  That's because in
     PerlThink,	the righthand side of a	s/// is	a double-quoted	string.	 \1 in
     the usual double-quoted string means a control-A.	The customary Unix
     meaning of	\1 is kludged in for s///.  However, if	you get	into the habit
     of	doing that, you	get yourself into trouble if you then add an /e

	 s/(\d+)/ \1 + 1 /eg;

     Or	if you try to do


     You can't disambiguate that by saying \{1}000, whereas you	can fix	it
     with ${1}000.  Basically, the operation of	interpolation should not be
     confused with the operation of matching a backreference.  Certainly they
     mean two different	things on the left side	of the s///.

     SEE ALSO    [Toc]    [Back]

     "Mastering	Regular	Expressions" (see the perlbook manpage)	by Jeffrey

								       Page 11

PERLRE(1)							     PERLRE(1)

								       PPPPaaaaggggeeee 11112222
[ Back ]
 Similar pages
Name OS Title
perlreref OpenBSD Perl Regular Expressions Reference
perlretut OpenBSD Perl regular expressions tutorial
perlrequick OpenBSD Perl regular expressions quick start
libpcre Linux Perl-compatible regular expressions: expresion syntax.
re_exec Tru64 Handle regular expressions
re_comp Tru64 Handle regular expressions
re_format FreeBSD POSIX 1003.2 regular expressions
re_format OpenBSD POSIX 1003.2 regular expressions
regex Linux POSIX 1003.2 regular expressions
perlfaq6 OpenBSD Regular Expressions ($Revision: 1.6 $, $Date: 2003/12/03 03:02:44 $)
Copyright © 2004-2005 DeniX Solutions SRL
newsletter delivery service