http://www.developer.com/

Back to article

Using Regular Expressions Groups to Isolate Sub-Matches


March 16, 2005

My most recent articles introduced several basic functions that you can perform with regular expressions, such as string splitting and using metacharacters for matching. However, before you can really get into the more powerful tasks that can be performed with regular expressions, you need to understand groups and their purpose. Grouping has two purposes:
  • Grouping several parts of a pattern together so that the entire group can be acted on via modifiers, operators, or quantifiers
  • Denoting "sub-matches" where the pattern dictates that a part of the entire match must be isolated. (An example—which I'll show shortly—is a North American telephone number where the pattern would help locate the phone number and the grouping syntax would enable you to extract just the area code.)

In my two previous articles, I introduced the MatchCollection and Match classes. Let's look at how those classes and the classes for groups and captures fit into the .NET regular expression class hierarchy:

  • Each Regex object has a MatchCollection object (which contains Match objects).
  • Each Match object has a GroupCollection object (which contains Group objects).
  • Each Group object has a CaptureCollection object (which contains Capture objects).

The following figure (modified from my book, Extending MFC Applications with the .NET Framework) graphically illustrates the relationship between the various .NET regular expression classes.



Figure 1. .NET Regular Expression Class Hierarchy
Click here for a larger image.

Defining and Enumerating Groups

Groups are denoted simply by placing parenthesis around the desired part of the pattern. Take a simple North American telephone number pattern as an example:
\d{3}-\d{3}-\d{4}
The \d represents any numerical digit (0-9), and the {n} modifier tells the parser to locate n number of items. In order to isolate the area code, you would simply put the parenthesis around that part of the pattern:
(\d{3})-\d{3}-\d{4}
Using the hierarchy shown in Figure 1, you can surmise that a simple loop will display all the groups from a match. The following generic function takes as its parameters an input string to parse and a pattern to use:
void DisplayGroups(String* input, String* pattern)
{
  try
  {
    StringBuilder* results = new StringBuilder();

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input); 
         match->Success; 
         match = match->NextMatch())
    {
      results->AppendFormat(S"Match {0} at {1}\r\n",
                            match->Value,
                            __box(match->Index));

      // for all of THIS match's groups
      GroupCollection* groups = match->Groups;
      for (int i = 0; i < groups->Count; i++)
      {
        results->AppendFormat(S"\tGroup {0} at {1}\r\n",
                              (groups->Item[i]->Value),
                              __box(groups->Item[i]->Index));

      }
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

Once the regular expression object has been created, the function enumerates the matches and, for each match, enumerates its groups. To test this code, you can call the following function, passing it a text string containing some phone numbers (remember to define the area code group with the parenthesis):

DisplayGroups(S"My phone numbers are 770-555-1212 and 404-555-1212", 
              S"(\\d{3})-\\d{3}-\\d{4}");

Figure 2 shows the results of using this example with the DisplayGroups function.


Figure 2. You Can Easily Enumerate the Groups of a Regular Expression Match

Extracting Specific Groups

Note from Figure 2 that each group collection contains—as its first group object—the entire match. Therefore, any defined groups (per the placement of parenthesis in the pattern) start at the second group object in the group collection. Since the DisplayGroups function does most of what you need, you can simply modify it a bit to create a function—ExtractAreaCodes—that is specific to extracting area codes from a text value:
void ExtractAreaCodes(String* input)
{
  try
  {
    StringBuilder* results = new StringBuilder();
    results->AppendFormat(S"The Area Codes for '{0}' are:\r\n\r\n", input);

    String* pattern = S"(\\d{3})-\\d{3}-\\d{4}";

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input); 
         match->Success; 
         match = match->NextMatch())
    {
      results->AppendFormat(S"\t{0}\r\n", match->Groups->Item[1]->Value);
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

As you can see, the only major changes to the function were to hard-code the pattern—as this function is dedicated to area codes—and the following parameter to the result object's AppendFormat call, which extracts the second group from the match's group collection object:

match->Groups->Item[1]->Value

Now you can test the ExtractAreaCodes function like this:

ExtractAreaCodes(S"My phone numbers are 770-555-1212 and 404-555-1212");

Doing so yields the expected results shown in Figure 3.


Figure 3. You Can Use Standard Collection Notation to Retrieve Specific Groups from a Match

Looking Ahead

You've learned the basics of how to define groups or sub-matches within a regular expression pattern and how to enumerate all the groups of a match as well as extract a specific group. At this point, you should be able to modify the code in this tip for situations where you need to both locate a particular pattern in a string using regular expressions and then extract specific sub-matches.

One thing that's not so nice about the ExtractAreaCodes function is that the code is hard-coded to retrieve the second object from the group collection. What if the pattern changes such that another group appears before the area code? The programmer would need to change the ExtractAreaCodes function—as well as any other functions depending on the specific order of groups within the group collection. Therefore, the next tip will cover how to name groups (in order to avoid this code-maintenance hassle) and explain how to define "non-capturing" groups.

About the Author

Tom Archer owns his own training company, Archer Consulting Group, which specializes in educating and mentoring .NET programmers and providing project management consulting. If you would like to find out how the Archer Consulting Group can help you reduce development costs, get your software to market faster, and increase product revenue, contact Tom through his Web site.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date