A regular expression is a sequence of characters arranged in a fashion to represent a template that matches a specific pattern of character symbols. Pattern matching comes in quite handy in validating a string formation that adheres to a certain specification norm. Regular expression eases the strenuous job of a substring matching with nested loops and compacts the logic in favour of better readability and code clarity. This almost unnoticeable mechanism can better be appreciated by its usage. In view of its advantages, most modern programming languages integrated various forms of support for regular expression into their tool suite. The article mainly focuses on the package java.util.regex which provides much needed support for pattern matching operations through Java regular expressions.
Structure of Java Regular Expressions
The structure of regular expression consists of normal characters, special constructs, quantifiers, boundary matches, character classes, and so forth. A normal character sequence, say “earth,” would match as is whereas “e.+h” would match any word that begins with e followed by any character represented by the wild card dot (.) and then one or more of any characters represented by + (plus) after dot (.) symbol followed by h as the last character.
Wildcard | Description |
. | Any character |
+ | One or more |
* | Zero or more |
? | Zero or one |
To match any of the characters, we can use a character class such as [abcd] (within angular braces) to match either a, b, c, or d. Now this can be inverted; in other words, not to match any of a, b, c or d by [^abcd]. Similarly, if we want to match any of the vowels, both uppercase and lowercase, we may write as [aAeEiIoOuU]. To match all the alphabet [a-z], to match numeric values 0 through 9, we may write [0-9]. There can be numerous such combinations. Java documentation provides comprehensive details of many regular expression constructs.
Using java.util.regex
Java provides a simple way to integrate regular expression in a program. The package contains two classes, Pattern and Matcher, one interface—MatchResult, and an exception—PatternSyntaxException.
Pattern
The pattern instance helps in creating a compiled pattern from an input string that adheres to regular expression norms and is replicable to more than one such string. This class does not have a constructor to instantiate an object; instead, it has two static methods that return an object of this class.
static Pattern compile(String regex) static Pattern compile(String regex, int flags)
These factory methods take a string object as an input sequence representing a regular expression and transform into a Pattern object. The flags represent a bit mask to regulate some variation during compilation, such as enabling case-insensitive matching, Unicode-aware case folding, Unicode version of Predefined character classes, POSIX character classes, and so on. Once the Pattern object is created, it is used to call another factory class, called matcher(CharSequence input), to create a Matcher object. CharSequence is an interface that represents a read-only set of characters values. Because the String class implements this interface, the matcher method has no problem in taking a string object as an argument.
Matcher
The Matcher instance takes up a single instance of the pattern and applies it to the target string. This class also does not have a constructor. It is instantiated with the help of the matcher(CharSequence input) factory method of the of the Pattern class. The Matcher class provides numerous factory methods to perform a variety of matching operations such as finding a pattern, replacing a matched pattern with a target string, getting the offset of pattern matches, and the like.
MatchResult
MatchResult encapsulates the outcome of a successful match. Because Matcher implements the MatchResult interface, the resultant data is readily available through a Matcher instance. However, the MatchResult interface is useful in saving the outcome of in-between successful matching attempts.
PatternSyntaxException
PatternSyntaxException raises an exception if the regular expression is not formed properly.
To put it in very simple words,
- The Pattern instance forms the regular expression pattern
- The Matcher instance takes the form and applies it to a target string
- PatternSyntaxException is the yeller, on occasion
Though not a comprehensive guideline, a few code examples would give an idea of its implementation in Java. There are many factory methods to supplement the needs of the developer. Here we’ll try a few of the frequently used ones.
A Quick Code Sample and Snippets
package org.mano.example; import java.util.regex.Matcher; import java.util.regex.Pattern; public class RGDemo1 { public static void main(String[] args) { String str = "A wise monkey is a monkey that doesn't monkey with another monkey's monkey."; String regex = "monkey"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(str); if (matcher.find()) System.out.println("found a match"); else System.out.println("a match not found"); } }
The idea of the preceding code is to create a pattern containing an input sequence. The subsequent creation of a matcher object determines whether or not the pattern is available in the sequence. The find() factory method looks for a subsequence of characters from beginning to end of the input sequence and returns a boolean value true if found and false otherwise.
If we want to find all the matches with start index and end offset, we may replace if with the while statement, as follows:
while (matcher.find()) { System.out.println(matcher.group() + " is matched from " + matcher.start() + " to " + matcher.end()); }
Wildcards and Quantifiers
The real power of regular expression comes from using wildcards and quantifiers. Suppose we want to match “doesn’t” in the above example. With the help of wildcards and quantifies, we may write the regular expression as shown in the next code snippet.
These patterns match the “doesn’t” string pattern in the preceding code but is not limited to this pattern only. Ply a little thought on the pattern below to find out what other subsequence it matches.
String regex = "d.+e.+'t"; regex = "d.+n't"; regex = "do.+?t"; regex = "do.*?t";
Replacing a Pattern
If we want to replace a specific pattern with another, we may do so with the help of the replace() method as follows.
package org.mano.example; import java.util.regex.Matcher; import java.util.regex.Pattern; public class RGDemo1 { public static void main(String[] args) { String str = "A wise monkey is a monkey that doesn't monkey with another monkey's monkey."; String regex = "monkey"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(str); System.out.println(matcher.replaceAll("donkey")); } }
Output
A wise donkey is a donkey that doesn't donkey with another donkey's donkey.
Conclusion
Java regular expressions support is simple with pretty straightforward classes to implement them. The key area to focus on to master regular expression is the syntactical structure of regular expression itself where Java as a language is just a means of implementation. The idea is similar for any other language such as JavaScript, Perl, C#, and so forth. The implementation process may vary slightly from one programming language to another, but the basic concept remains the same. Java’s take on in integrating this mechanism into its tool suite is its implementation simplicity that eased the life of Java programmer on many occasion, a tool worth learning for every Java programmer.