October 22, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Regular Expressions Primer

  • March 24, 2004
  • By Brad Lhotsky
  • Send Email »
  • More Articles »

Common Character Class Gotcha

In an attempt to match an IP address, which can contain four numbers ranging from 0 to 255 separated by a period, programmers often try something along the lines of the following:

/([0-255]\.){3}[0-255]/

At first glance and close examination, it's difficult to understand why this regex does not match what the programmer is attempting to match. When using a character class, the key word is character. In this context, the regex engine is not concerned with numbers, but with characters. What this regex optimizes to is:

/([0125]\.){3}[0125]/

This is most assuredly not what was intended. The range modifier inside of a character class evaluates the expression "0-255" as "0-2" + "55" or, "0125" as duplicate entries in a character class are optimized out. The regex to properly match an IP address is very complicated and beyond the scope of this article. Assuming no one is attempting to enter IP's in the 888.888.888.0/24, a programmer might construct this regex:

/(\d{1,3}\.?){4}/

Stay tuned for an in-depth discussion on this regex.

Quantifiers

Quantifiers allow a programmer to specify a determinately or indeterminately scale of the match of instances in their patterns. There are four quantifiers:

? Matches 0 or 1 consecutive instances of the previous group or character
* Matches 0 or more consecutive instances of the previous group or character
+ Matches 1 or more consecutive instances of the previous group or character
{a,z}    Range quantifier, specify a minimum (a) and a maximum (z) number of consecutive instances to match

Zero or More (*)

The '*' quantifier is almost always misused. Luckily, in most cases the damage is negligible, but there could still have some unexpected results if a programmer slips. Given the lines:

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a*/

Surprisingly, all three lines will match. The regex engine will always be able to find "zero or more a's" in any line of text you send it.

While this may not seem to be incredibly useful, it actually is. There are times when a programmer needs to match some text if it's there, or just have an empty string or null if that text isn't found. This is where the "zero or more" quantifier earns its keep.

One or More (+)

The '+' quantifier almost always is what a programmer means when they use the '*' quantifier. In the previous example:

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a+/

In this case, only lines 1 and 3 will successfully match as there is no 'a' one or more times in line 2. Often times, this is what was intended when a '*' was used.

Range Modifier ({a,z})

The range modifier allows the programmer finer-grain control of the number of consecutive matches to consider.

  1. a dog runs
  2. the dog jumps
  3. aaa is a car club

Which lines will successfully match the following regex?

/a{2,5}/

In this case, only line 3 will successfully match. The regex engine is looking for 2–5 consecutive a's.

Greedy versus Non-Greedy

Quantifiers come into two flavors, "Greedy" and "Non-Greedy." The only difference between the two is their relative ambition to match. Most regex bottlenecks are a direct result of poorly written Greedy or Non-Greedy matches.

The regex engine wants to match every pattern it's passed and it will do everything in its power to match that regex. This is why regex can slow down a program so easily. Misunderstanding the intention of the regex engine could result in large regex being evaluated millions of times over a formidable text sample.





Page 2 of 3



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel