August 27, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Regular Expressions Primer

  • March 24, 2004
  • By Brad Lhotsky
  • Send Email »
  • More Articles »

Greedy Matching

Greedy matching seems to be the de-facto standard for most regex tasks. These quantifiers are dubbed "greedy" because they are very ambitious and attempt to match as many times as they can while still allowing the rest of the regular expression to match. All of the quantifiers presented thus far are greedy.

my $string='1 2 3 4 5 6 7 8 9 10 12 12 13 14 15';
$string =~ /^.*(\d+).*$/;
my $number = $1;

What will $number contain? Without understanding greedy and non-greedy, and how the regex engine goes about satisfying the pattern, it's very difficult for a beginner to answer correctly. The answer is the "5" highlighted in red below:

'1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'

This example is confusing to beginners. Inspecting how the match is made should clear things up and hopefully take a giant leap towards understanding regex in general. Greedy quantifiers match the maximum number of instances they can on their first pass. If the rest of the regex fails as a result of the greedy quantifier, it will give up its bounty, one character at a time, until the entire regex can match.

  1. '^'—Start at the "beginning of string" anchor.
  2. '.*'—Match any character zero or more times. The regex engine matches this greedily, until it fails at the end of string.
  3. '(\d+)'—Fails. There is no string left to match, so the .* match gives up the character '5' to the regex.
  4. '(\d+)'—Succeeds matching '5' and storing it in $1.
  5. '.*'—Succeeds because it matches any character zero or more times.
  6. '$'—Succeeds; anchor position is currently end of string

Introducing Non-Greedy Quantifiers

Non-Greedy quantifiers are the lazy quantifiers. Where their greedy counterparts match the maximum number of instances before allowing the regex engine to continue, non-greedy quantifiers surrender control as soon as the minimum number of instances is satisfied. Non-greedy quantifiers will match more than the minimum only when it's necessary to have the entire regex succeed. The non-greedy quantifiers are the same as the greedy quantifiers immediately followed by a '?':

*? Matches 0 or more consecutive instances of the previous group or character, non-greedily
+? Matches 1 or more consecutive instances of the previous group or character, non-greedily
{a,z}?    Range quantifier, specify a minimum (a) and a maximum (z) number of consecutive instances to match, non-greedily

The previous example can also demonstrate the laziness of the non-greedy match.

my $string='1 2 3 4 5 6 7 8 9 10 12 12 13 14 15';
$string =~ /^.*(\d+).*$/;
my $number = $1;

What will $number contain this time? This time the answer is the "1" highlighted in red below:

'1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'
  1. '^'—Start at the "beginning of string" anchor.
  2. '.*?'—Match any character zero or more times. The regex engine lazily accepts no characters for this non-greedy match. It will allow characters to match this pattern only if those characters must be matched to satisfy the entire regex.
  3. '(\d+)'—Succeeds matching '1' and storing it in $1.
  4. '.*?'—Succeeds because it matches any character zero or more times.
  5. '$'—Fails, position is not currently end of string
  6. '(.*?)'—In an attempt to match the entire regex, .*? receives the rest of the string one character at a time until the position is the end of the string.
  7. '$'—Position is now the end of string, the entire regex matches!

Notes on Greed

There are two dangers when dealing with quantifiers. Using the wrong quantifier could match the wrong instance of text by being over or under zealous in its attempt to position itself to satisfy the entire regex. It could also lead to huge performance hits because the regex engine back tracks across itself and the target text several times to find the first arrangement that satisfies the regex entirely.

The use of ".*" is common with beginners, and is misused or entirely unnecessary in most cases. The use of anchors (^,$) should allow the programmer enough freedom to maneuver to the regex engine to the data they are seeking.

Headaches caused by matching the wrong data need to be addressed by breaking down the regular expression as done in this article. Remember, the regex engine wants to match the regular expression at any cost, so long as it's the cheapest route. The greedy matches will always get the maximum instances, still allowing the regex to match. The non-greedy matches will always get the minimum number of instances, still allowing the entire regex to match.

Revisiting the IP Address Match

Armed with knowledge of the regex engine's inner workings, dissection of the earlier IP Address match can reveal its shortcomings:

/(\d{1,3}\.?){4}/

This regex will match an IP Address such as "192.168.0.1". It will, however, also match strings such as "1234567.234". Anxiously deciding to optimize the regex to use quantifiers, a hapless programmer noted that the pattern "digit digit digit period" repeated three times in an IP Address and then was followed by a "digit digit digit." The "digit digit digit" was repeated four times in the string! The "period" happens "0 or 1" times, depending on which octet the cursor is at the end of. So, the attempt to shorten the regex inadvertently led to it not being as specific as intended. Had the programmer stopped at "digit digit digit period," they would've had a workable solution. Again, this example doesn't account for the fact that IP addresses max value per octet is 255, but demonstrates a powerful regex.

/(\d{1,3}\.){3}\d{1,3}/

Literally, "one to three digits followed by a period, three times, followed by one to three digits." This regex would be good enough to pick out IP Address-like strings from text; then validation could be done one octet at a time.

Closing Notes on Regular Expressions

Writing the "right" Regular Expression is often very difficult to do because machines and humans see text patterns completely differently. Humans are keen to pick up on spatial patterns, while machines are left to process the text one character at a time. Learning to read regular expressions exactly as the engine does can help write more efficient, more effective regular expressions.

In most circumstances, it is possible for the programmer to have access to the data they are attempting to match or extract from using regular expressions. A programmer should build their regular expressions by utilizing a relatively complete data set as the template. Do not attempt to write a regular expression to solve every problem. Specialize regular expressions as much as needed to get them to work right.

The regex engine in Perl is surrounded by millions of useful tools: Perl. Do not forget that. Most regex beginners are content to solve everything in a regex. Questions such as "How do I loop in a regex in perl?" are not as uncommon as one might hope. Regular Expressions match text, if looping is necessary, use foreach, for, while, or until. Remember, Perl is a huge tool chest with a million tools inside. There's no need to solve everything with a big hammer (or regex), even if it might be more fun initially.

About the Author

Brad Lhotsky is a Software Developer whose focus is primarily web based application in Perl and PHP. He has over 5 years experience developing systems for end users and system and network administrators. Brad has been active on Perl beginner's mailing lists and forums for years, attempting to give something back to the community.

Brad currently has one module released on the CPAN.



Page 3 of 3



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel