October 22, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Using Regular Expressions to Parse for Email Addresses

  • April 12, 2005
  • By Tom Archer
  • Send Email »
  • More Articles »
This final installment in my series on using the .NET regular expressions classes from Managed C++ takes much of what the previous installments taught to create production-quality, regular expression patterns for validating email addresses and parsing bodies of text for all email addresses. The first section begins with a basic pattern that—while not all-encompassing—causes the regular expressions parser to match the majority of email addresses in a supplied input string. The remainder of the column presents two more complex patterns that catch almost any email address format, and it fully explains the components of each pattern.

Basic Email Pattern

First, examine a generic function—GetEmailAddresses—that takes as its only argument an input string and returns an array of found email addresses. The email regular expression pattern utilized in this function is very basic, but it "catches" the majority of emails you run across:
using namespace System::Text::RegularExpressions;
using namespace System::Windows::Forms;
using namespace System::Collections;

...

ArrayList* GetEmailAddresses(String* input)
{
  try
  {
    ArrayList* al = new ArrayList();

    MatchCollection* mc = 
      Regex::Matches(input,
                     S"[\\w]+@[\\w]+.[\\w]{2,3}");

    for (int i=0; i < mc->Count; i++)
      al->Add(mc->Item[i]->Value);

    return al;
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

The first thing the GetEmailAddresses function does is construct an ArrayList object. This object is returned to the caller of the function, and it holds all of the email addresses located in the passed input string. From there, the static Regex::Matches method is called with the desired email pattern (which I'll cover shortly). The result of the Matches method is a MatchCollection. The MatchCollection object is then enumerated with a for loop, each "match" (representing an email address) is added to the ArrayList object, and finally the ArrayList is returned to the caller.

The GetEmailAddresses function can be used as follows where the returned ArrayList object is enumerated and each email address is written to the console:
ArrayList* addrs = 
  GetEmailAddresses(S"I can be reached at tom@archerconsultinggroup.com "
                    S"or info@archerconsultinggroup.com.");

for (int i = 0; i < addrs->Count; i++)
{
  Console::WriteLine(addrs->Item[i]);
}

The pattern used in the GetEmailAddresses function correctly yields the two addresses I specified in the sample call. The following is the pattern itself (with the double backslashes replaced by single backslashes, as that's specific to C++ and not really part of the actual pattern):

[\w]+@[\w]+.[\w]{2,3}

If you've read the previous installments of this series, you hopefully can read this pattern. Here's a breakdown of each component of the pattern:

  • [\w]+—The \w represents any word-letter (A-Z,a-z and 0-9). This is then bracketed so that the plus modifier can be used to specify one or more. Therefore, [\w]+ tells the parser to match on one or more word-letters.
  • @—Tells the parser to look for the at-sign—@—character
  • [\w]+—As before, tells the parser to match on one or more word-letters
  • .—Match on the period character
  • [\w]{2,3}—Once again, the pattern specifies a search for word-letters with the difference here being that the {x,y} modifier tells the parser to search for a specific number of word-letters—two or three, in this case.



Figure 1. Example of Parsing a Body of Text for Email Address & Domain Information
Click here for a larger image.





Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel