September 21, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Regular Expressions Primer

  • March 24, 2004
  • By Brad Lhotsky
  • Send Email »
  • More Articles »

"Regular Expression" is a fancy way to say "pattern matcher." Humans can match patterns with relative ease. A machine has a bit more difficulty deciphering patterns, especially in text. As computing became more powerful, the methods for matching text grew into more flexible dialects.

Regular expressions can be one of the toughest concepts to grasp and use effectively in any programming language. Perl is no exception because its regular expressions engine is perhaps the most advanced regex engine in existence. Its power and flexibility also serve to confuse and intimidate many newcomers. It is important to understand the Regular Expression engine because it's often the cause of serious bottlenecks in programs of all shapes and sizes.

This introduction aims to cover the basics of regular expressions as they pertain specifically to host and network administration. There are a large number of resources available that present regular expressions in a much broader context. Here are a few key things to note before proceeding:

  • "regex" is short for "Regular Expression"
  • "regex engine" is the component that translates regex into patterns
  • /abcd/ isn't just 'abcd'; it's 'a' followed by 'b' followed by 'c' followed by 'd'

Meta-Characters

Meta-characters are those characters for which the regex engine already has special meaning. To match these special characters, it's necessary to prefix them with a back slash "\". The following meta-characters will be covered in depth:

Meta-Character Type Description
\   Escapes the character proceeded
.   Matches any single character
a|z   Matches a or z
^ Anchor Matches the beginning of a string
$ Anchor Matches the end of a string
* Quantifier 0 or more of previous group or characters
+ Quantifier 1 or more of previous group or characters
? Quantifier 0 or 1 of previous group or characters
{a,z} Quantifier Matches if the previous group was found between a and z times.
{a,} Quantifier Matches if the previous group was found at least a times.
{,z} Quantifier Matches if the previous group was found no more than z times.
{n} Quantifier Matches if the previous group was found exactly n times.
[abcd] Character Class Matches if the character in this position is either an a, b, c, or d
[^abcd]    Inverted Character Class Matches if the character in this position is not an a, b, c, or d
(abcd) Grouping Groups the matches in the parentheses into a reference

Grouping

Grouping affords three main benefits: the ability to capture the data that regex matches, the ability to link several patterns together for quantifying, and the ability to reference the data matched by that group later in the regex. Capturing and linking are the two most common uses for grouping. Back references are rarely needed to accomplish most tasks and are usually presented in a manner beyond the scope of this article.

Matching text is useful, but usually a programmer is searching for a word, IP address, or URL buried inside of some text relatively positioned near a distinguishable mark of some sort. Usually, that mark is not important to the rest of the program, but the text recovered from knowing its position is invaluable. In this case, grouping is used to capture the important data, while leaving the rest of the regex to be forgotten as soon as it's finished evaluating. Perl "remembers" the results of the groups in special variables $1, $2, $3, and so forth, based on the position of the opening parenthesis.

my $line = 'First Name:     Bob';
$line =~ /^First Name :\s+(\S+)/;
my $first_name = $1;

$first_name will now contain "Bob".

By using a group to link together pieces of a pattern, it's possible to quantify that group as a whole. This is the simplest use of grouping and is incredibly powerful at the same time.

/^(ab)+$/

This regex will match 'ab', as well as 'abab', and 'abababababababababababab'.

Character Classes

Character classes are sets of characters that can be in a set position. Assuming a line begins with a number, using a combination of the "beginning of string" meta-character '^' and a character class that represents any numeric character, it would be easy to match:

/^[0-9]/

Matches the line. It may be more desirable to match lines with one or more numeric characters in the beginning:

/^[0-9]/

If more precision is required, it's possible to specify the number of digits that will satisfy our match by using a more specific quantifier:

/^[0-9]{1,6}/

This will match a line that begins with one to six digits. Surprisingly, this regex will still match lines with 7,8,9... digits. To match starting with 1 to 6 digits, we need to tell the regex engine that the next character can't be a digit.

/^[0-9]{1,6}[^0-9]/

By using the Inverse Character Class, we can be explicit and avoid any confusion between our interpretation and the regex's matching.

The caret in the Character Class served to invert the class. Inside the Character Class Meta-characters ( [ ] ), there are three meta-characters:

^ (as the first character only) Invert the character class
\ Escape the next character
- Range modifier, translates a-d to abcd

A simple example:

/[ab]/

Matches a or b

/[^ab]/

Matches any character that's NOT a or b

Breaking down the more complex regex, the engine reads it as:

^

Anchor, tells the engine to start the match at the beginning of the string

followed by ..

[0-9]{1,6}

Match 1 to 6 characters that are any of the following: 0,1,2,3,4,5,6,7,8,9

followed by ..

[^0-9]

any character that is not one of the following: 0,1,2,3,4,5,6,7,8,9

Perl provides aliases to commonly used character classes to save typing and reduce some of the complexity of regular expression authoring.

Alias      Meaning Equivalent Character Class
\d Matches a digit [0-9]
\D Matches a non-digit [^0-9]
\w Matches a word character, alphanumeric [a-zA-Z0-9]
\W Matches a non-word character, non-alphanumeric    [^a-zA-Z0-9]
\s Matches a whitespace character [ \t\r\n\f]
\S Matches a non-whitespace character [^ \t\r\n\f]

By using these aliases, it's possible to rewrite the previous example as:

/^\d{1,6}\D/




Page 1 of 3



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel