Contents /
Previous /
Next
Regular Expressions -- String Matching
One of the most powerful features of Perl
is its regular expression (RE)
engine.
A regular expression is contained in slashes, and matching occurs with
the =~ operator. The following expression
is true if the string the appears in variable $sentence.
$sentence =~ /the/
The RE is case sensitive, so if
$sentence = "The quick brown fox";
then the above match will be false.
The operator !~ is
used for spotting a non-match. In the above example
$sentence !~ /the/
is true because the string the does not appear in $sentence.
it is often useful to assign the string to be matched to the special
variable $_:
$sentence = "The quick brown fox";
if (/The/)
{
print "Found match\n";
}
Special Characters in Perl REs
. # Any single character except a newline
^ # The beginning of the line or string
$ # The end of the line or string
* # Zero or more of the last character
+ # One or more of the last character
? # Zero or one of the last character
Square brackets are used to match any one of the characters inside
them. Inside square brackets a - indicates "between" and
a ^ at the beginning means "not":
[qjk] # Either q or j or k
[^qjk] # Neither q nor j nor k
[a-z] # Anything from a to z inclusive
[^a-z] # No lower case letters
[a-zA-Z] # Any letter
[a-z]+ # Any non-zero sequence of lower case letters
A vertical bar | represents an "or" and parentheses
(...)
can be used to group things together:
jelly|cream # Either jelly or cream
(eg|le)gs # Either eggs or legs
(da)+ # Either da or dada or dadada or...
More special characters:
\n # A newline
\t # A tab
\w # Any alphanumeric (word) character.
# The same as [a-zA-Z0-9_]
\W # Any non-word character.
# The same as [^a-zA-Z0-9_]
\d # Any digit. The same as [0-9]
\D # Any non-digit. The same as [^0-9]
\s # Any whitespace character: space,
# tab, newline, etc
\S # Any non-whitespace character
\b # A word boundary, outside [] only
\B # No word boundary
Clearly characters like
$,
|,
[,
),
\,
/
and so on are peculiar cases in
regular expressions. If you want to match for one of those then
you have to preceed it by a backslash. So:
\| # Vertical bar
\[ # An open square bracket
\) # A closing parenthesis
\* # An asterisk
\^ # A carat symbol
\/ # A slash
\\ # A backslash
Examples for REs
"Hello World" =~ /World/; # matches
"Hello World" =~ m!World!; # matches, delimited by '!'
"Hello World" =~ m{World}; # matches, note the matching '{}'
"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
"Hello World" =~ /world/; # doesn't match, case sensitive
"Hello World" =~ /o W/; # matches, ' ' is an ordinary char
"Hello World" =~ /World /; # doesn't match, no ' ' at end
"Hello World" =~ /o/; # matches 'o' in 'Hello'
"That hat is red" =~ /hat/; # matches 'hat' in 'That'
"2+2=4" =~ /2+2/; # doesn't match,
# + is a metacharacter
"2+2=4" =~ /2\+2/; # matches,
# \+ is treated like an ordinary +
"/usr/bin/perl" =~
/\/usr\/local\/bin\/perl/; # matches
"1000\t2000" =~ m(0\t2) # matches
"cat" =~ /\143\x61\x74/ # matches,
# but a weird way to spell cat
"housekeeper" =~ /keeper/; # matches
"housekeeper" =~ /^keeper/; # doesn't match
"housekeeper" =~ /keeper$/; # matches
"housekeeper\n" =~ /keeper$/; # matches
"housekeeper" =~ /^housekeeper$/; # matches
"abc" =~ /[cab]/; # matches
"cats and dogs" =~ /cat|dog|bird/; # matches "cat"
"cats and dogs" =~ /dog|cat|bird/; # matches "cat"
"cats" =~ /c|ca|cat|cats/; # matches "c"
"cats" =~ /cats|cat|ca|c/; # matches "cats"
/(a|b)b/; # matches 'ab' or 'bb'
/(^a|b)c/; # matches 'ac' at start of string
# or 'bc' anywhere
/house(cat|)/; # matches either 'housecat' or 'house'
/house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
# 'house'. Note groups can be nested.
"20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
# because '20\d\d' can't match
/[a-z]+\s+\d*/; # match a lowercase word, at least some space,
# and any number of digits
/(\w+)\s+\1/; # match doubled words of arbitrary length
$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
# than 4 digits
$year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
Extracting matches
The grouping metacharacters "()" allow the extraction of the
parts of a string that matched.
For each grouping with "()", the part that
matched inside goes into the special variables $1, $2, etc., Example:
# extract hours, minutes, seconds
$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
$hours = $1;
$minutes = $2;
$seconds = $3;
In list context, a match "/regex/" with groupings will return the list
of matched values "($1,$2,...)".
So we could rewrite the above example as follows:
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
Example "grep from stdin (can be a file using:
while( <> )
{
if( /(fred|Fred|john|john)/ )
{
print "Found match: $1\n in line: $_\n";
}
}
The global modifier "//g" allows the matching operator to match within
a string as many times as possible.
Perl keeps track of position in the string as it goes along.
You can get or set the position with the "pos()" function:
$text = "cat dog house"; # 3 words
while ($text =~ /(\w+)/g)
{
print "Word is $1, ends at position ", pos $text, "\n";
}
Output:
Word is cat, ends at position 3
Word is dog, ends at position 7
Word is house, ends at position 13
Backreferences
Associated with the matching variables $1, $2, ... are the
backreferences "\1", "\2", ...
Backreferences are matching variables that can be used inside a regex:
/(\w\w\w)\s\1/; # find sequences like 'the the' in string
Matching Principles for Multiple Matching Alternatives
When a regexp can match a string in several different ways, we can
use the following principles to predict which way the regexp will match:
- Taken as a whole, any regexp will be matched at the
earliest possible position in the string.
- In an alternation "a|b|c...", the leftmost alternative
that allows a match for the whole regexp will be the one
used.
- The maximal matching quantifiers "?", "*", "+" and
"{n,m}" will in general match as much of the string as possible
while still allowing the whole regexp to match.
- If there are two or more elements in a regexp, the
leftmost greedy quantifier, if any, will match as much of the
string as possible while still allowing the whole regexp to
match. The next leftmost greedy quantifier, if any, will try to
match as much of the string remaining available to it as possible,
while still allowing the whole regexp to match. And so on,
until all the regexp elements are satisfied.
As we have seen above, Principle 0 overrides the others - the regexp
will be matched as early as possible, with the other principles
determining how the regexp matches at that earliest character posi
tion.
Greedy or Non-Greeedy
Per default a match is greedy (Principle 3).
Sometimes greed is not good, because we would like quantifiers to
match a minimal piece of string, rather than a maximal piece.
For this purpose we can use the minimal match or non-greedy
quantifiers "??","*?", "+?", and "{}?". These are the usual quanti
fiers with a "?" appended to them, Examples:
"a??" = match 'a' 0 or 1 times.
Try 0 first, then 1.
"a*?" = match 'a' 0 or more times,
i.e., any number of times,
but as few times as possible
"a+?" = match 'a' 1 or more times,
i.e., at least once,
but as few times as possible
"a{n,m}?" = match at least "n" times,
not more than "m" times,
as few times as possible
"a{n,}?" = match at least "n" times,
but as few times as possible
Search and Replace Functions
- [ =~ ] [ m ] //
[ g ] [ i ] [ m ] [ o ] [ s ] [ x ]
- Searches EXPR (default: $_) for a pattern. If you
prepend an m you can use almost any pair of delimiters
instead of the slashes. If used in array context,
an array is returned consisting of the subexpressions
matched by the parentheses in the pattern, i.e., ($1,$2,$3,...).
Optional modifiers: g matches as many times as
possible; i searches in a case-insensitive manner; o
interpolates variables only once. m treats the string
as multiple lines; s treats the string as a single line; x
allows for regular expression extensions.
If PATTERN is empty, the most recent pattern from a
previous match or replacement is used.
With g the match can be used as an iterator in scalar
context.
- ?PATTERN?
- This is just like the /PATTERN/ search, except that it
matches only once between calls to the reset operator.
- [ $VAR =~ ] s/PATTERN/REPLACEMENT/
[ e ] [ g ] [ i ] [ m ] [ o ] [ s ] [ x ]
- Searches a string for a pattern, and if found, replaces
that pattern with the replacement text. It returns the
number of substitutions made, if any; if no substitutions are made, it
returns false.
Optional modifiers: g replaces all occurrences of the
pattern; e evaluates the replacement string as a Perl
expression; for any other modifiers, see /PATTERN/
matching. Almost any delimiter may replace the
slashes; if single quotes are used, no interpretation is
done on the strings between the delimiters, otherwise
the strings are interpolated as if inside double quotes.
If bracketing delimiters are used, PATTERN and
REPLACEMENT may have their own delimiters, e.g.,
s(foo)[bar]. If PATTERN is empty, the most recent pattern
from a previous match or replacement is used.
- [ $VAR =~ ] tr/SEARCHLIST/REPLACEMENTLIST/
[ c ] [ d ] [ s ]
- Translates all occurrences of the characters found in
the search list with the corresponding character in
the replacement list. It returns the number of characters
replaced. y may be used instead of tr.
Optional modifiers: c complements the SEARCHLIST;
d deletes all characters found in SEARCHLIST that do
not have a corresponding character in
REPLACEMENTLIST; s squeezes all
sequences of characters that are translated into the same target character
into one occurrence of this character.
- pos SCALAR
- Returns the position where the last m//g search left off
for SCALAR. May be assigned to.
- study [ $VAR† ]
- Study the scalar variable $VAR in anticipation of
performing many pattern matches on its contents
before the variable is next modified.
Examples for Search and Replace Operations Using RE
$x = "Time to feed the cat!";
$x =~ s/cat/hacker/; # $x contains
# "Time to feed the hacker!"
$y = "'quoted words'";
$y =~ s/^'(.*)'$/$1/; # strip single quotes,
# $y contains "quoted words"
$x = "I batted 4 for 4";
$x =~ s/4/four/; # $x contains "I batted four for 4"
$x = "I batted 4 for 4";
$x =~ s/4/four/g; # $x contains "I batted four for four"
Examples for the split Operator Using RE
"split /regex/, string" splits "string" into a list of substrings and
returns that list. The regex determines the character sequence that
"string" is split with respect to.
# Split a string into words.
$x = "Calvin and Hobbes";
@word = split /\s+/, $x;
# $word[0] = 'Calvin'
# $word[1] = 'and'
# $word[2] = 'Hobbes'
# To extract a comma-delimited list of numbers.
$x = "1.618,2.718, 3.142";
@const = split /,\s*/, $x;
# $const[0] = '1.618'
# $const[1] = '2.718'
# $const[2] = '3.142'
See also:
man perlrequick - Perl regular expressions quick start
man perlretut - Perl regular expressions tutorial