http://www.developer.com/

Back to article

Regular Expressions Primer


March 24, 2004

"Regular Expression" is a fancy way to say "pattern matcher." Humans can match patterns with relative ease. A machine has a bit more difficulty deciphering patterns, especially in text. As computing became more powerful, the methods for matching text grew into more flexible dialects.

Regular expressions can be one of the toughest concepts to grasp and use effectively in any programming language. Perl is no exception because its regular expressions engine is perhaps the most advanced regex engine in existence. Its power and flexibility also serve to confuse and intimidate many newcomers. It is important to understand the Regular Expression engine because it's often the cause of serious bottlenecks in programs of all shapes and sizes.

This introduction aims to cover the basics of regular expressions as they pertain specifically to host and network administration. There are a large number of resources available that present regular expressions in a much broader context. Here are a few key things to note before proceeding:

  • "regex" is short for "Regular Expression"
  • "regex engine" is the component that translates regex into patterns
  • /abcd/ isn't just 'abcd'; it's 'a' followed by 'b' followed by 'c' followed by 'd'

Meta-Characters

Meta-characters are those characters for which the regex engine already has special meaning. To match these special characters, it's necessary to prefix them with a back slash "\". The following meta-characters will be covered in depth:

Meta-Character Type Description
\   Escapes the character proceeded
.   Matches any single character
a|z   Matches a or z
^ Anchor Matches the beginning of a string
$ Anchor Matches the end of a string
* Quantifier 0 or more of previous group or characters
+ Quantifier 1 or more of previous group or characters
? Quantifier 0 or 1 of previous group or characters
{a,z} Quantifier Matches if the previous group was found between a and z times.
{a,} Quantifier Matches if the previous group was found at least a times.
{,z} Quantifier Matches if the previous group was found no more than z times.
{n} Quantifier Matches if the previous group was found exactly n times.
[abcd] Character Class Matches if the character in this position is either an a, b, c, or d
[^abcd]    Inverted Character Class Matches if the character in this position is not an a, b, c, or d
(abcd) Grouping Groups the matches in the parentheses into a reference

Grouping

Grouping affords three main benefits: the ability to capture the data that regex matches, the ability to link several patterns together for quantifying, and the ability to reference the data matched by that group later in the regex. Capturing and linking are the two most common uses for grouping. Back references are rarely needed to accomplish most tasks and are usually presented in a manner beyond the scope of this article.

Matching text is useful, but usually a programmer is searching for a word, IP address, or URL buried inside of some text relatively positioned near a distinguishable mark of some sort. Usually, that mark is not important to the rest of the program, but the text recovered from knowing its position is invaluable. In this case, grouping is used to capture the important data, while leaving the rest of the regex to be forgotten as soon as it's finished evaluating. Perl "remembers" the results of the groups in special variables $1, $2, $3, and so forth, based on the position of the opening parenthesis.

my $line = 'First Name:     Bob';
$line =~ /^First Name :\s+(\S+)/;
my $first_name = $1;

$first_name will now contain "Bob".

By using a group to link together pieces of a pattern, it's possible to quantify that group as a whole. This is the simplest use of grouping and is incredibly powerful at the same time.

/^(ab)+$/

This regex will match 'ab', as well as 'abab', and 'abababababababababababab'.

Character Classes

Character classes are sets of characters that can be in a set position. Assuming a line begins with a number, using a combination of the "beginning of string" meta-character '^' and a character class that represents any numeric character, it would be easy to match:

/^[0-9]/

Matches the line. It may be more desirable to match lines with one or more numeric characters in the beginning:

/^[0-9]/

If more precision is required, it's possible to specify the number of digits that will satisfy our match by using a more specific quantifier:

/^[0-9]{1,6}/

This will match a line that begins with one to six digits. Surprisingly, this regex will still match lines with 7,8,9... digits. To match starting with 1 to 6 digits, we need to tell the regex engine that the next character can't be a digit.

/^[0-9]{1,6}[^0-9]/

By using the Inverse Character Class, we can be explicit and avoid any confusion between our interpretation and the regex's matching.

The caret in the Character Class served to invert the class. Inside the Character Class Meta-characters ( [ ] ), there are three meta-characters:

^ (as the first character only) Invert the character class
\ Escape the next character
- Range modifier, translates a-d to abcd

A simple example:

/[ab]/

Matches a or b

/[^ab]/

Matches any character that's NOT a or b

Breaking down the more complex regex, the engine reads it as:

^

Anchor, tells the engine to start the match at the beginning of the string

followed by ..

[0-9]{1,6}

Match 1 to 6 characters that are any of the following: 0,1,2,3,4,5,6,7,8,9

followed by ..

[^0-9]

any character that is not one of the following: 0,1,2,3,4,5,6,7,8,9

Perl provides aliases to commonly used character classes to save typing and reduce some of the complexity of regular expression authoring.

Alias      Meaning Equivalent Character Class
\d Matches a digit [0-9]
\D Matches a non-digit [^0-9]
\w Matches a word character, alphanumeric [a-zA-Z0-9]
\W Matches a non-word character, non-alphanumeric    [^a-zA-Z0-9]
\s Matches a whitespace character [ \t\r\n\f]
\S Matches a non-whitespace character [^ \t\r\n\f]

By using these aliases, it's possible to rewrite the previous example as:

/^\d{1,6}\D/

Common Character Class Gotcha

In an attempt to match an IP address, which can contain four numbers ranging from 0 to 255 separated by a period, programmers often try something along the lines of the following:

/([0-255]\.){3}[0-255]/

At first glance and close examination, it's difficult to understand why this regex does not match what the programmer is attempting to match. When using a character class, the key word is character. In this context, the regex engine is not concerned with numbers, but with characters. What this regex optimizes to is:

/([0125]\.){3}[0125]/

This is most assuredly not what was intended. The range modifier inside of a character class evaluates the expression "0-255" as "0-2" + "55" or, "0125" as duplicate entries in a character class are optimized out. The regex to properly match an IP address is very complicated and beyond the scope of this article. Assuming no one is attempting to enter IP's in the 888.888.888.0/24, a programmer might construct this regex:

/(\d{1,3}\.?){4}/

Stay tuned for an in-depth discussion on this regex.

Quantifiers

Quantifiers allow a programmer to specify a determinately or indeterminately scale of the match of instances in their patterns. There are four quantifiers:

? Matches 0 or 1 consecutive instances of the previous group or character
* Matches 0 or more consecutive instances of the previous group or character
+ Matches 1 or more consecutive instances of the previous group or character
{a,z}    Range quantifier, specify a minimum (a) and a maximum (z) number of consecutive instances to match

Zero or More (*)

The '*' quantifier is almost always misused. Luckily, in most cases the damage is negligible, but there could still have some unexpected results if a programmer slips. Given the lines:

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a*/

Surprisingly, all three lines will match. The regex engine will always be able to find "zero or more a's" in any line of text you send it.

While this may not seem to be incredibly useful, it actually is. There are times when a programmer needs to match some text if it's there, or just have an empty string or null if that text isn't found. This is where the "zero or more" quantifier earns its keep.

One or More (+)

The '+' quantifier almost always is what a programmer means when they use the '*' quantifier. In the previous example:

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a+/

In this case, only lines 1 and 3 will successfully match as there is no 'a' one or more times in line 2. Often times, this is what was intended when a '*' was used.

Range Modifier ({a,z})

The range modifier allows the programmer finer-grain control of the number of consecutive matches to consider.

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a{2,5}/

In this case, only line 3 will successfully match. The regex engine is looking for 2–5 consecutive a's.

Greedy versus Non-Greedy

Quantifiers come into two flavors, "Greedy" and "Non-Greedy." The only difference between the two is their relative ambition to match. Most regex bottlenecks are a direct result of poorly written Greedy or Non-Greedy matches.

The regex engine wants to match every pattern it's passed and it will do everything in its power to match that regex. This is why regex can slow down a program so easily. Misunderstanding the intention of the regex engine could result in large regex being evaluated millions of times over a formidable text sample.

Greedy Matching

Greedy matching seems to be the de-facto standard for most regex tasks. These quantifiers are dubbed "greedy" because they are very ambitious and attempt to match as many times as they can while still allowing the rest of the regular expression to match. All of the quantifiers presented thus far are greedy.

my $string='1 2 3 4 5 6 7 8 9 10 12 12 13 14 15';
$string =~ /^.*(\d+).*$/;
my $number = $1;

What will $number contain? Without understanding greedy and non-greedy, and how the regex engine goes about satisfying the pattern, it's very difficult for a beginner to answer correctly. The answer is the "5" highlighted in red below:

'1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'

This example is confusing to beginners. Inspecting how the match is made should clear things up and hopefully take a giant leap towards understanding regex in general. Greedy quantifiers match the maximum number of instances they can on their first pass. If the rest of the regex fails as a result of the greedy quantifier, it will give up its bounty, one character at a time, until the entire regex can match.

  1. '^'—Start at the "beginning of string" anchor.
  2. '.*'—Match any character zero or more times. The regex engine matches this greedily, until it fails at the end of string.
  3. '(\d+)'—Fails. There is no string left to match, so the .* match gives up the character '5' to the regex.
  4. '(\d+)'—Succeeds matching '5' and storing it in $1.
  5. '.*'—Succeeds because it matches any character zero or more times.
  6. '$'—Succeeds; anchor position is currently end of string

Introducing Non-Greedy Quantifiers

Non-Greedy quantifiers are the lazy quantifiers. Where their greedy counterparts match the maximum number of instances before allowing the regex engine to continue, non-greedy quantifiers surrender control as soon as the minimum number of instances is satisfied. Non-greedy quantifiers will match more than the minimum only when it's necessary to have the entire regex succeed. The non-greedy quantifiers are the same as the greedy quantifiers immediately followed by a '?':

*? Matches 0 or more consecutive instances of the previous group or character, non-greedily
+? Matches 1 or more consecutive instances of the previous group or character, non-greedily
{a,z}?    Range quantifier, specify a minimum (a) and a maximum (z) number of consecutive instances to match, non-greedily

The previous example can also demonstrate the laziness of the non-greedy match.

my $string='1 2 3 4 5 6 7 8 9 10 12 12 13 14 15';
$string =~ /^.*(\d+).*$/;
my $number = $1;

What will $number contain this time? This time the answer is the "1" highlighted in red below:

'1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'
  1. '^'—Start at the "beginning of string" anchor.
  2. '.*?'—Match any character zero or more times. The regex engine lazily accepts no characters for this non-greedy match. It will allow characters to match this pattern only if those characters must be matched to satisfy the entire regex.
  3. '(\d+)'—Succeeds matching '1' and storing it in $1.
  4. '.*?'—Succeeds because it matches any character zero or more times.
  5. '$'—Fails, position is not currently end of string
  6. '(.*?)'—In an attempt to match the entire regex, .*? receives the rest of the string one character at a time until the position is the end of the string.
  7. '$'—Position is now the end of string, the entire regex matches!

Notes on Greed

There are two dangers when dealing with quantifiers. Using the wrong quantifier could match the wrong instance of text by being over or under zealous in its attempt to position itself to satisfy the entire regex. It could also lead to huge performance hits because the regex engine back tracks across itself and the target text several times to find the first arrangement that satisfies the regex entirely.

The use of ".*" is common with beginners, and is misused or entirely unnecessary in most cases. The use of anchors (^,$) should allow the programmer enough freedom to maneuver to the regex engine to the data they are seeking.

Headaches caused by matching the wrong data need to be addressed by breaking down the regular expression as done in this article. Remember, the regex engine wants to match the regular expression at any cost, so long as it's the cheapest route. The greedy matches will always get the maximum instances, still allowing the regex to match. The non-greedy matches will always get the minimum number of instances, still allowing the entire regex to match.

Revisiting the IP Address Match

Armed with knowledge of the regex engine's inner workings, dissection of the earlier IP Address match can reveal its shortcomings:

/(\d{1,3}\.?){4}/

This regex will match an IP Address such as "192.168.0.1". It will, however, also match strings such as "1234567.234". Anxiously deciding to optimize the regex to use quantifiers, a hapless programmer noted that the pattern "digit digit digit period" repeated three times in an IP Address and then was followed by a "digit digit digit." The "digit digit digit" was repeated four times in the string! The "period" happens "0 or 1" times, depending on which octet the cursor is at the end of. So, the attempt to shorten the regex inadvertently led to it not being as specific as intended. Had the programmer stopped at "digit digit digit period," they would've had a workable solution. Again, this example doesn't account for the fact that IP addresses max value per octet is 255, but demonstrates a powerful regex.

/(\d{1,3}\.){3}\d{1,3}/

Literally, "one to three digits followed by a period, three times, followed by one to three digits." This regex would be good enough to pick out IP Address-like strings from text; then validation could be done one octet at a time.

Closing Notes on Regular Expressions

Writing the "right" Regular Expression is often very difficult to do because machines and humans see text patterns completely differently. Humans are keen to pick up on spatial patterns, while machines are left to process the text one character at a time. Learning to read regular expressions exactly as the engine does can help write more efficient, more effective regular expressions.

In most circumstances, it is possible for the programmer to have access to the data they are attempting to match or extract from using regular expressions. A programmer should build their regular expressions by utilizing a relatively complete data set as the template. Do not attempt to write a regular expression to solve every problem. Specialize regular expressions as much as needed to get them to work right.

The regex engine in Perl is surrounded by millions of useful tools: Perl. Do not forget that. Most regex beginners are content to solve everything in a regex. Questions such as "How do I loop in a regex in perl?" are not as uncommon as one might hope. Regular Expressions match text, if looping is necessary, use foreach, for, while, or until. Remember, Perl is a huge tool chest with a million tools inside. There's no need to solve everything with a big hammer (or regex), even if it might be more fun initially.

About the Author

Brad Lhotsky is a Software Developer whose focus is primarily web based application in Perl and PHP. He has over 5 years experience developing systems for end users and system and network administrators. Brad has been active on Perl beginner's mailing lists and forums for years, attempting to give something back to the community.

Brad currently has one module released on the CPAN.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date