http://www.developer.com/

Back to article

Using Regular Expressions to Parse for Email Addresses


April 12, 2005

This final installment in my series on using the .NET regular expressions classes from Managed C++ takes much of what the previous installments taught to create production-quality, regular expression patterns for validating email addresses and parsing bodies of text for all email addresses. The first section begins with a basic pattern that—while not all-encompassing—causes the regular expressions parser to match the majority of email addresses in a supplied input string. The remainder of the column presents two more complex patterns that catch almost any email address format, and it fully explains the components of each pattern.

Basic Email Pattern

First, examine a generic function—GetEmailAddresses—that takes as its only argument an input string and returns an array of found email addresses. The email regular expression pattern utilized in this function is very basic, but it "catches" the majority of emails you run across:
using namespace System::Text::RegularExpressions;
using namespace System::Windows::Forms;
using namespace System::Collections;

...

ArrayList* GetEmailAddresses(String* input)
{
  try
  {
    ArrayList* al = new ArrayList();

    MatchCollection* mc = 
      Regex::Matches(input,
                     S"[\\w]+@[\\w]+.[\\w]{2,3}");

    for (int i=0; i < mc->Count; i++)
      al->Add(mc->Item[i]->Value);

    return al;
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

The first thing the GetEmailAddresses function does is construct an ArrayList object. This object is returned to the caller of the function, and it holds all of the email addresses located in the passed input string. From there, the static Regex::Matches method is called with the desired email pattern (which I'll cover shortly). The result of the Matches method is a MatchCollection. The MatchCollection object is then enumerated with a for loop, each "match" (representing an email address) is added to the ArrayList object, and finally the ArrayList is returned to the caller.

The GetEmailAddresses function can be used as follows where the returned ArrayList object is enumerated and each email address is written to the console:

ArrayList* addrs = 
  GetEmailAddresses(S"I can be reached at tom@archerconsultinggroup.com "
                    S"or info@archerconsultinggroup.com.");

for (int i = 0; i < addrs->Count; i++)
{
  Console::WriteLine(addrs->Item[i]);
}

The pattern used in the GetEmailAddresses function correctly yields the two addresses I specified in the sample call. The following is the pattern itself (with the double backslashes replaced by single backslashes, as that's specific to C++ and not really part of the actual pattern):

[\w]+@[\w]+.[\w]{2,3}

If you've read the previous installments of this series, you hopefully can read this pattern. Here's a breakdown of each component of the pattern:

  • [\w]+—The \w represents any word-letter (A-Z,a-z and 0-9). This is then bracketed so that the plus modifier can be used to specify one or more. Therefore, [\w]+ tells the parser to match on one or more word-letters.
  • @—Tells the parser to look for the at-sign—@—character
  • [\w]+—As before, tells the parser to match on one or more word-letters
  • .—Match on the period character
  • [\w]{2,3}—Once again, the pattern specifies a search for word-letters with the difference here being that the {x,y} modifier tells the parser to search for a specific number of word-letters—two or three, in this case.



Figure 1. Example of Parsing a Body of Text for Email Address & Domain Information
Click here for a larger image.

Advanced Email Regular Expressions Pattern

While the previous email pattern would catch most of the email addresses, it is far from complete. This section illustrates a step at a time how to build a much more robust email pattern that will catch just about every valid email address format. To begin with, the following pattern catches "exact matches". In other words, you shouldn't use it to parse a document, but rather to validate a single email address:
^[^@]+@([-\w]+\.)+[A-Za-z]{2,4}$
Personally, I find it easier to read a pattern by dissecting it into components and then attempting to understand each of the components as they relate to the overall pattern. Having said that, this pattern breaks down to the following parts:
  • ^—The caret character at the beginning of a pattern tells the parser to match from the beginning of the line, since the focus of this pattern is to validate a single email address.
  • [^@]+—When the caret character is used within brackets and precedes other characters, it tells the parser to search for everything that is not the specified character. Therefore, here the pattern specifies a search to locate all text that is not an at-sign—@ character. (The plus sign tells the parser to find one or more of these non at-sign characters leading up to the next component of the expression.)
  • @—Match on the at-sign literal
  • ([-\w]+\.)+—This part of the pattern is for matching everything from the @ to the upper-level domain (e.g., .com, .edu, etc.). The reason for this is that many times you'll see email addresses with a format like tom.archer@archerconsultinggroup.com. Therefore, this part of the pattern deals with that scenario. The first part—[-\w]+—tells the parser to find one or more word-letters or dashes. The "\." tells the parser to match on those characters leading up to a period. Finally, all of that is placed within parentheses and modified with the plus operator to specify one or more instances of the entire match.
  • [A-Za-z]{2,4}$—Matches the terminating part of the expression—the upper-level domain. At this point, reading this part of the pattern should be pretty easy. It simply dictates finding between two- and four-letter characters. The $ character tells the parser that these letters should be the end of the input string. (In other words, $ denotes end of input, compared with ^, which denotes beginning of input.)
In order to test for "direct matches", you need a very simple function like the following:
using namespace System::Text::RegularExpressions;
...
bool ValidateEmailAddressFormat(String* email)
{
  Regex* rex = 
    new Regex(S"^[^@]+@([-\\w]+\\.)+[A-Za-z]{2,4}$");
  return rex->IsMatch(email);
}
You then can call this function like this:
bool b;

// SUCCESS
b = ValidateEmailAddressFormat("tom.archer@archerconsultinggroup.com");

// FAILURE!!
b = ValidateEmailAddressFormat("tom.archerarcherconsultinggroup.com");
Now, let's tweak the pattern so that it can be used to parse a document for all of its contained email addresses:
([-\.\w^@]+@(?:[-\w]+\.)+[A-Za-z]{2,4})+

The main differences between this pattern and the previous one are the following:

  • I removed the beginning-of-line metacharacter—^—since the pattern will be used to search through an entire string for all email addresses (instead of being used to validate the entire string for a single email address).
  • I used the ?: capture inhibitor operator so that I don't capture unneeded submatches.
  • As with the beginning-of-line metacharacter, I also removed the end-of-line metacharacter—$.
  • I implemented additional "grouping" to locate all emails in a provided input string.

So the natural question at this point would be "Is this pattern guaranteed to find every single valid email address?" After doing quite a bit of research on this issue it turns out that an all-encompassing email regular expression pattern is almost 6,000 bytes in length! However, that pattern would be necessary to catch only a very miniscule percentage of email addresses that the patterns illustrated in this article won't. The two patterns that I've covered will catch 99 percent of all email addresses.

Regular Expressions: A Lot of Ground to Cover

My original intention for a series on using the .NET regular expressions classes from Managed C++ was to simply cover some basic patterns and usages. However, the more I wrote, the more I realized needed to be covered. So it turned out to be a much-longer-than-planned series. It covered splitting strings, finding matches within a string, using regular expression metacharacters, grouping, creating named groups, working with captures, performing advanced search-and-replace functions, and finally writing a complex email pattern.

Hopefully along the way, those of you who are new to regular expressions saw just how powerful they can be. Just think of how much manual text parsing code would be necessary to parse a block of code for (almost) every conceivable email address. Compare that with the single line of code it takes with regular expressions! For those who wish to learn still more about working with the .NET regular expressions classes, my book—Extending MFC Applications with the .NET Framework—provides a full 50-page chapter on the subject and introduces half a dozen demo applications with code that you can easily plug into your own production code.

Acknowledgements

I would like to thank Don J. Plaistow, a Perl and Regular Expressions guru who helped me tremendously when I first started learning regular expressions. Don's help was especially helpful with regards to the email patterns in this article.

About the Author

Tom Archer owns his own training company, Archer Consulting Group, which specializes in educating and mentoring .NET programmers and providing project management consulting. If you would like to find out how the Archer Consulting Group can help you reduce development costs, get your software to market faster, and increase product revenue, contact Tom through his Web site.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date