Contents / Previous / Next


Regular Expressions -- String Matching

One of the most powerful features of Perl is its regular expression (RE) engine.

A regular expression is contained in slashes, and matching occurs with the =~ operator. The following expression is true if the string the appears in variable $sentence.

$sentence =~ /the/
The RE is case sensitive, so if
$sentence = "The quick brown fox";
then the above match will be false.

The operator !~ is used for spotting a non-match. In the above example

$sentence !~ /the/
is true because the string the does not appear in $sentence.

it is often useful to assign the string to be matched to the special variable $_:

$sentence = "The quick brown fox";
if (/The/)
{
	print "Found match\n";
}


Special Characters in Perl REs

.	# Any single character except a newline
^	# The beginning of the line or string
$	# The end of the line or string
*	# Zero or more of the last character
+	# One or more of the last character
?	# Zero or one of the last character

Square brackets are used to match any one of the characters inside them. Inside square brackets a - indicates "between" and a ^ at the beginning means "not":

[qjk] 	  # Either q or j or k
[^qjk] 	  # Neither q nor j nor k
[a-z] 	  # Anything from a to z inclusive
[^a-z]	  # No lower case letters
[a-zA-Z]  # Any letter
[a-z]+ 	  # Any non-zero sequence of lower case letters

A vertical bar | represents an "or" and parentheses (...) can be used to group things together:

jelly|cream	# Either jelly or cream
(eg|le)gs	# Either eggs or legs
(da)+    	# Either da or dada or dadada or...

More special characters:

\n 	# A newline
\t 	# A tab
\w 	# Any alphanumeric (word) character.
 	# The same as [a-zA-Z0-9_]
\W 	# Any non-word character.
 	# The same as [^a-zA-Z0-9_]
\d 	# Any digit. The same as [0-9]
\D 	# Any non-digit. The same as [^0-9]
\s 	# Any whitespace character: space,
 	# tab, newline, etc
\S 	# Any non-whitespace character
\b 	# A word boundary, outside [] only
\B 	# No word boundary

Clearly characters like $, |, [, ), \, / and so on are peculiar cases in regular expressions. If you want to match for one of those then you have to preceed it by a backslash. So:

\| 	# Vertical bar
\[ 	# An open square bracket
\) 	# A closing parenthesis
\* 	# An asterisk
\^ 	# A carat symbol
\/ 	# A slash
\\ 	# A backslash


Examples for REs

"Hello World" =~ /World/; # matches "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}' "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', "Hello World" =~ /world/; # doesn't match, case sensitive "Hello World" =~ /o W/; # matches, ' ' is an ordinary char "Hello World" =~ /World /; # doesn't match, no ' ' at end "Hello World" =~ /o/; # matches 'o' in 'Hello' "That hat is red" =~ /hat/; # matches 'hat' in 'That' "2+2=4" =~ /2+2/; # doesn't match, # + is a metacharacter "2+2=4" =~ /2\+2/; # matches, # \+ is treated like an ordinary + "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches "1000\t2000" =~ m(0\t2) # matches "cat" =~ /\143\x61\x74/ # matches, # but a weird way to spell cat "housekeeper" =~ /keeper/; # matches "housekeeper" =~ /^keeper/; # doesn't match "housekeeper" =~ /keeper$/; # matches "housekeeper\n" =~ /keeper$/; # matches "housekeeper" =~ /^housekeeper$/; # matches "abc" =~ /[cab]/; # matches "cats and dogs" =~ /cat|dog|bird/; # matches "cat" "cats and dogs" =~ /dog|cat|bird/; # matches "cat" "cats" =~ /c|ca|cat|cats/; # matches "c" "cats" =~ /cats|cat|ca|c/; # matches "cats" /(a|b)b/; # matches 'ab' or 'bb' /(^a|b)c/; # matches 'ac' at start of string # or 'bc' anywhere /house(cat|)/; # matches either 'housecat' or 'house' /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or # 'house'. Note groups can be nested. "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', # because '20\d\d' can't match /[a-z]+\s+\d*/; # match a lowercase word, at least some space, # and any number of digits /(\w+)\s+\1/; # match doubled words of arbitrary length $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates


Extracting matches

The grouping metacharacters "()" allow the extraction of the parts of a string that matched. For each grouping with "()", the part that matched inside goes into the special variables $1, $2, etc., Example: # extract hours, minutes, seconds $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format $hours = $1; $minutes = $2; $seconds = $3; In list context, a match "/regex/" with groupings will return the list of matched values "($1,$2,...)".
So we could rewrite the above example as follows: ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

Example "grep from stdin (can be a file using: while( <> ) { if( /(fred|Fred|john|john)/ ) { print "Found match: $1\n in line: $_\n"; } }

The global modifier "//g" allows the matching operator to match within a string as many times as possible. Perl keeps track of position in the string as it goes along. You can get or set the position with the "pos()" function:

$text = "cat dog house"; # 3 words while ($text =~ /(\w+)/g) { print "Word is $1, ends at position ", pos $text, "\n"; } Output: Word is cat, ends at position 3 Word is dog, ends at position 7 Word is house, ends at position 13


Backreferences

Associated with the matching variables $1, $2, ... are the backreferences "\1", "\2", ...
Backreferences are matching variables that can be used inside a regex: /(\w\w\w)\s\1/; # find sequences like 'the the' in string


Matching Principles for Multiple Matching Alternatives

When a regexp can match a string in several different ways, we can use the following principles to predict which way the regexp will match:
  1. Taken as a whole, any regexp will be matched at the earliest possible position in the string.
  2. In an alternation "a|b|c...", the leftmost alternative that allows a match for the whole regexp will be the one used.
  3. The maximal matching quantifiers "?", "*", "+" and "{n,m}" will in general match as much of the string as possible while still allowing the whole regexp to match.
  4. If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.
As we have seen above, Principle 0 overrides the others - the regexp will be matched as early as possible, with the other principles determining how the regexp matches at that earliest character posi­ tion.


Greedy or Non-Greeedy

Per default a match is greedy (Principle 3).

Sometimes greed is not good, because we would like quantifiers to match a minimal piece of string, rather than a maximal piece.

For this purpose we can use the minimal match or non-greedy quantifiers "??","*?", "+?", and "{}?". These are the usual quanti­ fiers with a "?" appended to them, Examples:

"a??" = match 'a' 0 or 1 times.
         Try 0 first, then 1.

"a*?" = match 'a' 0 or more times, 
        i.e., any number of times, 
        but as few times as possible

"a+?" = match 'a' 1 or more times, 
        i.e., at least once, 
        but as few times as possible

"a{n,m}?" = match at least "n" times, 
            not more than "m" times,
            as few times as possible

"a{n,}?" = match at least "n" times, 
           but as few times as possible


Search and Replace Functions

[ =~ ] [ m ] // [ g ] [ i ] [ m ] [ o ] [ s ] [ x ]
Searches EXPR (default: $_) for a pattern. If you prepend an m you can use almost any pair of delimiters instead of the slashes. If used in array context, an array is returned consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1,$2,$3,...).
Optional modifiers: g matches as many times as possible; i searches in a case-insensitive manner; o interpolates variables only once. m treats the string as multiple lines; s treats the string as a single line; x allows for regular expression extensions.
If PATTERN is empty, the most recent pattern from a previous match or replacement is used.
With g the match can be used as an iterator in scalar context.
?PATTERN?
This is just like the /PATTERN/ search, except that it matches only once between calls to the reset operator.
[ $VAR =~ ] s/PATTERN/REPLACEMENT/ [ e ] [ g ] [ i ] [ m ] [ o ] [ s ] [ x ]
Searches a string for a pattern, and if found, replaces that pattern with the replacement text. It returns the number of substitutions made, if any; if no substitutions are made, it returns false.
Optional modifiers: g replaces all occurrences of the pattern; e evaluates the replacement string as a Perl expression; for any other modifiers, see /PATTERN/ matching. Almost any delimiter may replace the slashes; if single quotes are used, no interpretation is done on the strings between the delimiters, otherwise the strings are interpolated as if inside double quotes.
If bracketing delimiters are used, PATTERN and REPLACEMENT may have their own delimiters, e.g., s(foo)[bar]. If PATTERN is empty, the most recent pattern from a previous match or replacement is used.
[ $VAR =~ ] tr/SEARCHLIST/REPLACEMENTLIST/ [ c ] [ d ] [ s ]
Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced. y may be used instead of tr.
Optional modifiers: c complements the SEARCHLIST; d deletes all characters found in SEARCHLIST that do not have a corresponding character in REPLACEMENTLIST; s squeezes all sequences of characters that are translated into the same target character into one occurrence of this character.
pos SCALAR
Returns the position where the last m//g search left off for SCALAR. May be assigned to.
study [ $VAR† ]
Study the scalar variable $VAR in anticipation of performing many pattern matches on its contents before the variable is next modified.


Examples for Search and Replace Operations Using RE

$x = "Time to feed the cat!"; $x =~ s/cat/hacker/; # $x contains # "Time to feed the hacker!" $y = "'quoted words'"; $y =~ s/^'(.*)'$/$1/; # strip single quotes, # $y contains "quoted words" $x = "I batted 4 for 4"; $x =~ s/4/four/; # $x contains "I batted four for 4" $x = "I batted 4 for 4"; $x =~ s/4/four/g; # $x contains "I batted four for four"

Examples for the split Operator Using RE

"split /regex/, string" splits "string" into a list of substrings and returns that list. The regex determines the character sequence that "string" is split with respect to. # Split a string into words. $x = "Calvin and Hobbes"; @word = split /\s+/, $x; # $word[0] = 'Calvin' # $word[1] = 'and' # $word[2] = 'Hobbes' # To extract a comma-delimited list of numbers. $x = "1.618,2.718, 3.142"; @const = split /,\s*/, $x; # $const[0] = '1.618' # $const[1] = '2.718' # $const[2] = '3.142'


See also:
man perlrequick - Perl regular expressions quick start
man perlretut - Perl regular expressions tutorial