Regular expressions are not new; they were introduced years ago through tools such as grep or awk. Perl boosted their popularity by offering direct access to regular expressions from a scripting language.
Until recently, Java programmers had to turn to third-party libraries such as IBM’s regex for Java. The newest JDK, JDK 1.4, makes regular expressions into first-class citizens.
What Is a Regular Expression?
Regular expressions are ideal for text manipulation. In a nutshell, regular expressions describe the format of strings. In its simplest form, a regular expression is the text that you need to match. For example, the regex “ABC” will match the string ABC but not the string DEF.
You can also use wild card characters, such as the star (*), to match more strings. For example, the regular expression “A*B” will match the strings AB, AAB, AAAB or any other string starting with As and ending with a single B.
Regular expressions have two key applications: validate or parse the input. In the first case, you use regular expressions to test whether a string matches a given pattern. For example, you can validate that a phone number is of the form (000) 000-0000 where “0” are numbers.
Regular expressions are also useful to parse input into its constituents. For example, the regular expression “(*): (.*)” matches the pair key/value that is so common with Internet headers. You will use the regular expression to retrieve the values from e-mail or HTTP messages.
Regular Expressions and Java
In Java, regular expressions appear in a new package, java.regex. It features a simple and very clean interface. You only need to learn about two new classes (Pattern and Matcher) and one interface (CharSequence). Pattern compiles regular expressions into Matcher. It takes a regular expression and compiles it into a Matcher. The later applies regular expressions to strings or, more correctly, CharSequences.
Pattern has no public constructor. You have to call the static compile() method, as demonstrated in the following listing:
import java.util.regex.*; public class SampleRegex { public static void main(String[] params) { Pattern pattern = Pattern.compile("(.*):(.*)"); Matcher matcher = pattern.matcher(params[0]); if(matcher.matches()) { System.out.print("Key:"); System.out.println(matcher.group(1)); System.out.print("Value:"); System.out.println(matcher.group(2)); } else System.out.print("No match"); } }
The listing uses the regular expression introduced previously. For example, calling the application as:
java SampleRegex "domain: developer.com"
will print:
Key: domain Value: developer.com
This illustrates using a regular expression to parse the input. However, if it is called with:
java SampleRegex "developer.com"
it will print:
No match
This shows that regular expressions are useful for validation, too.
CharSequence
I mentioned CharSequence in the previous section. CharSequence is a new interface defined in the java.lang package as an array of characters. String has been updated to implement CharSequence.
You also can wrap files as CharSequence through the new I/O package, also introduced in JDK 1.4. Unfortunately, regular expressions are not backward compatible with the traditional I/O API. In practice, that’s a real limitation because most Java applications depend on traditional Java I/O. The trick, if you have to parse InputStreams, is to read the file line by line into a set of strings and parse the strings.
Conclusion
Java programmers should no longer be jealous of their Perl colleagues. With the new JDK, regular expressions are an integral part of the Java platform. Furthermore, the new API is surprisingly simple and clean.
About the Author
Benoît Marchal is a Belgian developer and writer. He is the author of XML by Example (two editions), Applied XML Solutions, and Java Web Services. There’s more on this topic at marchal.com.