September 17, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Using Regular Expressions Groups to Isolate Sub-Matches

  • March 16, 2005
  • By Tom Archer
  • Send Email »
  • More Articles »
My most recent articles introduced several basic functions that you can perform with regular expressions, such as string splitting and using metacharacters for matching. However, before you can really get into the more powerful tasks that can be performed with regular expressions, you need to understand groups and their purpose. Grouping has two purposes:
  • Grouping several parts of a pattern together so that the entire group can be acted on via modifiers, operators, or quantifiers
  • Denoting "sub-matches" where the pattern dictates that a part of the entire match must be isolated. (An example—which I'll show shortly—is a North American telephone number where the pattern would help locate the phone number and the grouping syntax would enable you to extract just the area code.)

In my two previous articles, I introduced the MatchCollection and Match classes. Let's look at how those classes and the classes for groups and captures fit into the .NET regular expression class hierarchy:

  • Each Regex object has a MatchCollection object (which contains Match objects).
  • Each Match object has a GroupCollection object (which contains Group objects).
  • Each Group object has a CaptureCollection object (which contains Capture objects).

The following figure (modified from my book, Extending MFC Applications with the .NET Framework) graphically illustrates the relationship between the various .NET regular expression classes.



Figure 1. .NET Regular Expression Class Hierarchy
Click here for a larger image.

Defining and Enumerating Groups

Groups are denoted simply by placing parenthesis around the desired part of the pattern. Take a simple North American telephone number pattern as an example:
\d{3}-\d{3}-\d{4}
The \d represents any numerical digit (0-9), and the {n} modifier tells the parser to locate n number of items. In order to isolate the area code, you would simply put the parenthesis around that part of the pattern:
(\d{3})-\d{3}-\d{4}
Using the hierarchy shown in Figure 1, you can surmise that a simple loop will display all the groups from a match. The following generic function takes as its parameters an input string to parse and a pattern to use:
void DisplayGroups(String* input, String* pattern)
{
  try
  {
    StringBuilder* results = new StringBuilder();

    Regex* rex = new Regex(pattern);

    // for all the matches
    for (Match* match = rex->Match(input); 
         match->Success; 
         match = match->NextMatch())
    {
      results->AppendFormat(S"Match {0} at {1}\r\n",
                            match->Value,
                            __box(match->Index));

      // for all of THIS match's groups
      GroupCollection* groups = match->Groups;
      for (int i = 0; i < groups->Count; i++)
      {
        results->AppendFormat(S"\tGroup {0} at {1}\r\n",
                              (groups->Item[i]->Value),
                              __box(groups->Item[i]->Index));

      }
    }
    MessageBox::Show(results->ToString());
  }
  catch(Exception* pe)
  {
    MessageBox::Show(pe->Message);
  }
}

Once the regular expression object has been created, the function enumerates the matches and, for each match, enumerates its groups. To test this code, you can call the following function, passing it a text string containing some phone numbers (remember to define the area code group with the parenthesis):

DisplayGroups(S"My phone numbers are 770-555-1212 and 404-555-1212", 
              S"(\\d{3})-\\d{3}-\\d{4}");

Figure 2 shows the results of using this example with the DisplayGroups function.


Figure 2. You Can Easily Enumerate the Groups of a Regular Expression Match





Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel