November 24, 2014
Hot Topics:

Using Regular Expressions to Search and Replace Text

  • April 7, 2005
  • By Tom Archer
  • Send Email »
  • More Articles »
A common task when dealing with user input or text files is searching through that input and replacing literals, special characters (such as carriage-return/line feed pairs in files), or patterns (such as phone numbers, contractions, etc.). In fact, I recently finished working on a chatterbot (an artificial-intelligence application that verbally responds to voice or keyboard input) where this very task was needed in order to "smooth out" the user's input into something that the bot could more readily understand and respond to.

As a result, I wrote this column about performing basic search and replace tasks on user input via the .NET regular expressions classes.

Replacing Literals

The simplest type of search-and-replace functionality is to replace literals—that is, to instruct the regular expressions engine to parse an input string for a given substring and replace it with another. For this purpose, the Regex class defines several overloaded instance and static methods. Let's look at a couple of examples to see how easy this is.

In the following code, the function (ReplaceSimple) performs numerous literal transactions—such as replacing multiple spaces with a single space and reversing pronouns, to-be verbs, and personal pronouns (The chatterbot I worked on always reverses the sentence to make its answers more logical.):
String* ReplaceSimple(String* input)
{
  String* result = input;

  try
  {
    result = result->ToUpper();

    // remove multiple spaces
    result = result->Replace(result, S"\\s{2,}", " ");

    // reverse pronouns
    // result = result->Replace(result, S"\\sI\\s", " YOU ");

    // reverse to-be verbs
    // result = result->Replace(result, S"\\sAM\\s", " ARE ");

    // reverse personal pronouns
    // result = result->Replace(result, S"\\sMY\\s", " YOUR ");
  }
  catch(Exception* ex)
  {
    Console::WriteLine(ex->Message);
  }

  return result;
}

Figure 1 shows an example of running this code snippet.



Figure 1. Example of Performing Simple Literal Replacement Using the Regex Class
Click here for a larger image.

As you can see, this is extremely easy. In fact, I could have just as easily used the String::Replace method to do the same job where the syntax is almost identical (with the difference being that the input string is not passed—as String::Replace is an instance method).

Now, let's look at a text-replacement task that specifically takes advantage of regular expressions—using groups and substitution patterns.

Using Groups and Substitution Patterns

Previous columns discussed how to define groups during the parsing of an expression. One of the most powerful aspects of regular expressions is the ability to define a named group and then use that group in a search-and-replace scenario. For example, say that you need to parse a document, locate all text formatted a certain way, and then reformat it. Obviously, that's more involved than simply replacing the found text with a literal. It involves using the found text in more of a dynamic way. With regular expressions, you can accomplish this via substitution patterns.

Substitution patterns are essentially special characters that tell the parser how you wish to replace the found text. Table 1 lists the most commonly used substitution patterns.

PatternMeaning
${group}Replaces the found text with the specified group
$nReplaces the found text with the group at index n
$$Denote the actual dollar sign as the dollar sign is the substitution pattern prefix
$&Denotes the entire match
$`Substitutes all the text leading up to found text
$'Substitutes all the text following the found text
$+Substitutes the last group captured
$_Substitutes the entire match

While all of these patterns are useful to varying degrees, the two that you'll find yourself using the most when performing search-and-replace tasks are the first two. They allow you to specify a named group during the capture (parsing) and then use the found text in the replacement. To illustrate this, consider a real-life scenario where you need to reformat dates. Using regular expressions and the ${group} substitution pattern, the following function (ConvertDateFormat) converts between U.S. and European date formats:

String* ConvertDateFormat(String* input, bool USInput)
{
  String* result;

  if (USInput)
  {
    String* regexp1 = S"(?<month>\\d{1,2})-"
                      S"(?<day>\\d{1,2})-"
                      S"(?<year>\\d{2,4})";
    String* regexp2 = S"${day}-${month}-${year}";
    result = Regex::Replace(input, regexp1, regexp2);
  }
  else
  {
    String* regexp1 = S"(?<day>\\d{1,2})-"
                      S"(?<month>\\d{1,2})-"
                      S"(?<year>\\d{2,4})";
    String* regexp2 = S"${month}-${day}-${year}";
    result = Regex::Replace(input, regexp1, regexp2);
  }

  return result;
}

While far from being an all-encompassing function, the ConvertDateFormat function should show you how easy it is to use groups as replacement text. As you can see, the first regular expression built—regexp1—is the following:

(?<month>\d{1,2})-(?<day>\d{1,2})-(?<year>\d{2,4})
This will cause the parser to create three distinct groups: month, day, and year. Each of these groups is simply a match for one-to-two digits (except year, which is a match for two or four digits) between a hyphen separator character. The second expression—regexp2—then uses the groups defined from the first expression to shift the date's components. Finally, the Regex::Replace method is called and passed the input string (the unformatted date) and the two expressions. Assuming you passed a date such as "8-11-64", the returned result would be an expected transliterated value of "11-8-64".

You might also note that technically the function doesn't need a boolean value because regardless of what you pass as the second parameter, the first two sets of digits—whether they represent month and day or day and month are going to be reversed. However, I coded it like this simply to make the processing logic more obvious. Having said that, look at how you could change the code to use the $n substitution pattern and not have the conditional logic:

String* ConvertDateFormat2(String* input)
{
  String* regexp1 = S"(?<first>\\d{1,2})-"
                    S"(?<second>\\d{1,2})-"
                    S"(?<year>\\d{2,4})";
  String* regexp2 = S"$2-$1-${year}";

  return Regex::Replace(input, regexp1, regexp2);
}
In the ConvertDateFormat2 function, I've used the more generic group names of first and second, as I don't know which group represents month and which represents day. The regexp2 variable then specifies that a substitution pattern of $2-$1-${year}, which basically tells the parser to replace the found text with the second group, a hyphen, the first group, another hyphen, and then the group named year. Obviously, I could have used the group names again—first and second—but I wanted to show you how to use the group index value.

Looking Ahead

While intentionally simple, the examples presented in this column, along with substitution patterns listed in Table 1, should show you how you can easily introduce powerful search-and-replace functionality in your application. In the next—and final—column on using regular expressions, you'll see an extremely complex—and frequently requested—regular expression that enables you to parse a body of text for virtually any email address format.

About the Author

Tom Archer owns his own training company, Archer Consulting Group, which specializes in educating and mentoring .NET programmers and providing project management consulting. If you would like to find out how the Archer Consulting Group can help you reduce development costs, get your software to market faster, and increase product revenue, contact Tom through his Web site.






Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Rocket Fuel