Understanding Basic Regular Expressions PatternsWhen I began writing about the .NET implementation of regular expressions, I intended to focus solely on the .NET classes and not on regular expressions themselves. To that end, I started with a couple of very basic tasks that can be formed with regular expressions: splitting strings with the Regex::Split method and using the Match and MatchCollection methods to enumerate found literals or patterns.
Many readers wrote in explaining that they are new to regular expressions, and they want at least an article or two that explain some basic patterns--instead of having all the examples search for literals or unexplained patterns. Therefore, this article presents some very basic regular expression patterns to get people who are new to this area off and running.
Searching for Letters and WordsThe first thing to note is that regular expression patterns consist of metacharacters, which are simply characters that represent other characters and tell the regular expressions parser how to scan the input string and what to search for.
The following sections illustrate some basic patterns that utilize some of the more commonly used metacharacters. The end of the article presents a table that you can use as a metacharacter quick-reference when creating your own patterns.
Here's a basic string that all the example patterns in this article use:
John and Harry are members of the Borbon club.Now, suppose you want to locate all the proper nouns in this sentence. Logically, you would know that you need to locate every capitalized word. In terms of a regular expression pattern, that would look like the following:
[A-Z][a-z]+[ ]*If you're new to regular expressions, this definitely will look a bit strange at first. The following breakdown explains each of the pattern's components:
So there you have it. The following pattern simply states "Find a capital letter, followed by one or more lowercase letters, followed by any number of spaces."
Using this pattern results in the following list of matches:
John Harry BorbonHowever, the pattern has two problems. First, it will not catch capitalized abbreviations or acronyms, as it stipulates that only one capital letter will be matched, followed by lowercase letters. Therefore, the current pattern used on the following input value will not yield the match for IBM:
John and Harry are members of the Borbon club at IBM
To fix this problem, you need to modify the pattern as follows where I've bolded change:
[A-Z][A-Z|a-z]+[ ]*What I've inserted into this pattern is the A-Z range and the vertical bar separator (|), which acts as an "or" operator. Therefore, the pattern now states: "Find a single capital letter followed by one or more upper and lower case letters followed by any number of spaces". The pattern will now yield the following:
John Harry Borbon IBM
The second problem that the pattern has is it doesn't handle singular pronouns correctly. This is the easiest problem to solve. All you need to do is replace the plus sign in the pattern with an asterisk, so that the parser knows that there may not be a sequence of letters following the capital letter. Using an input value of: "John, Harry and I are members of the Borbon club at IBM.", the pattern would be as follows:
[A-Z][A-Z|a-z]*[ ]*It would yield:
John Harry I Borbon IBM