dcsimg
December 4, 2016
Hot Topics:

Exploring the Java String Tokenizer

  • June 27, 2016
  • By Manoj Debnath
  • Send Email »
  • More Articles »

String tokenization is a process where a string is broken into several parts. Each part is called a token. For example, if "I am going" is a string, the discrete parts—such as "I", "am", and "going"—are the tokens. Java provides ready classes and methods to implement the tokenization process. They are quite handy to convey a specific semantics or contextual meaning to several individual parts of a string. This is particularly useful for text processing where you need to break a string into several parts and use each part as an element for individual processing. In a nutshell, tokenization is useful in any situation where you need to disorganize a string into individual parts; something to achieve with the part for the whole and whole for the part concept. This article provides information for a comprehensive understanding of the background concepts and its implementation in Java.

String Tokenization with StringTokenizer

A token or an individual element of a string can be filtered during infusion, meaning we can define the semantics of a token when extracting discrete elements from a string. For example, in a string say, "Hi! I am good. How about you?", sometimes we may need to treat each word as a token or, at other times a set of words collectively as a token. So, a token basically is a flexible term and does not necessarily meant to be an atomic part, although it may be atomic according to the discretion of the context. For example, the keywords of a language are atomic according to the lexical analysis of the language, but they may typically be non-atomic and convey different meaning under a different context.

Note: Parsing is an important part of language processing. It is a process to resolve a language statement into several parts (tokens) and describe their roles. Tokenization is the process that breaks a statement into tokens, but parsing has a wider connotation; parsing can bring out the essence of the tokens through syntax and semantic roles defined by the grammar of the language. A complete description is beyond the scope of this article. Parsing has several uses; one of them is in designing a compiler.
org.mano.example;

import java.util.StringTokenizer;

public class Main {

   public static void main(String[] args) {

      StringTokenizer st1 = new StringTokenizer("Hi!
         I am good. How about you?");

      for (int i = 1; st1.hasMoreTokens(); i++)
         System.out.println("Token "+i+":
            "+st1.nextToken());

   }
}

The tokens are:

Token 1: Hi!
Token 2: I
Token 3: am
Token 4: good.
Token 5: How
Token 6: about
Token 7: you?

Now, if we change the code to the following:

StringTokenizer st1 = new StringTokenizer("Hi!
   I am good. How about you?", ".");

The tokens are:

Token 1: Hi! I am good
Token 2:  How about you?

Observe that the StringTokenizer class contains three constructors, as follows: (refer to the Java API Documentation)

  • StringTokenizer(String str)
  • StringTokenizer(String str, String delim)
  • StringTokenizer(String str, String delim, boolean returnDelims)

when we create a StringTokenizer object with the second constructor, we can define a delimiter to split the tokens as per our need. If we do not provide any, space is taken as a default delimiter. In the preceding example, we have used "." (dot/stop character) as a delimiter. Note that the delimiting character itself is not taken into account as a token. It is simply used as a token separator without itself being a part of the token. This can be seen when the tokens are printed in the example code above; observe that "." is not printed.

So, in a situation where we want to control whether to count the delimited character also as a token or not, we may use the third constructor. This constructor takes a boolean argument to enable/disable the delimited character as a part of the token. We also can provide a delimiting character later while extracting tokens with the nextToken(String delim) method.

We may also use delimited character as " \n\r\t\f" to mean space, newline, carriage return, and line-feed character, respectively.

Accessing individual tokens is no big deal. StringTokenizer contains six methods to cover the tokens.

Method Description
int countTokens()
  Calculates the number of times that this tokenizer's nextToken method can be called before it generates an exception.
boolean hasMoreElements()
  Returns the same value as the hasMoreTokens method.
boolean hasMoreTokens()
  Tests if there are more tokens available from this tokenizer's string.
Object nextElement()
  Returns the same value as the nextToken method, except that its declared return value is an Object rather than a String.
String nextToken()
  Returns the next token from this string's tokenizer.
String nextToken(String delim)
  Returns the next token in this string's tokenizer's string.

They are quite simple. Refer to the Java API Documentation for details about each of them.

String Tokenization with the split Method

The split method defined in the String class is more versatile in the tokenization process. Here, we can use Regular Expression to break up strings into basic tokens.

According to the Java API Documentation:

"StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead."

The preceding example with StringTokenizer can be rewritten with the string split method as follows:

package org.mano.example;

public class Main {

   public static void main(String[] args) {

      String[] tokens="Hi! I am good. How about
         you?".split("\\s");

      for (int i = 0; i<tokens.length; i++)
         System.out.println("Token "+(i+1)+":
            "+tokens[i]);

   }
}

Output:

Token 1: Hi!
Token 2: I
Token 3: am
Token 4: good.
Token 5: How
Token 6: about
Token 7: you?
package org.mano.example;


public class Main {

   public static void main(String[] args) {

      String[] tokens="Hi! I am good. How about
         you?".split("\\.");

      for (int i = 0; i<tokens.length; i++)
         System.out.println("Token "+(i+1)+":
            "+tokens[i]);

   }
}

Output:

Token 1: Hi! I am good
Token 2:  How about you?

To extract the numeric value from the string below, we may change the code as follows with regular expression.

String[] tokens="Hi! I am good. 24234y45 64 How
   64565 645 about you?".split("[^0-9]+");

String Tokenization with Regular Expression

As we can see, the strength of the split method of the String class is in its ability to use Regular Expression. We can use wild cards and quantifiers to match a particular pattern in a Regular Expression. This pattern then can be used as the delimitation basis of token extraction.

Java has a dedicated package, called java.util.regex, to deal with Regular Expression. This package consists of two classes, Matcher and Pattern, an interface MatchResult, and an exception called PatternSyntaxException.

Regular Expression is quite an extensive topic in itself. Let's not deal with is here; instead, let's focus only on the tokenization preliminary through the Matcher and Pattern classes. These classes provide supreme flexibility in the process of tokenization with a complexity to become a topic in itself. A pattern object represents a compiled regular expression that is used by the Matcher object to perform three functions, such as:

  • Match input string against the pattern.
  • Match input string, starting at the beginning against the pattern.
  • Scan and look out for the next subsequence that matches the pattern.

For tokenization, the Matcher and Pattern classes may be used as follows:

package org.mano.example;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

   public static void main(String[] args) {
      String s="Hi! I am good. How about you?";

      Pattern pattern = Pattern.compile("[\\w]+.");
      Matcher matcher = pattern.matcher(s);

      for(int i=1; matcher.find();i++)
         System.out.println("Token "+i+": "
            + matcher.group());

   }
}

Output:

Token 1: Hi!
Token 2: I
Token 3: am
Token 4: good.
Token 5: How
Token 6: about
Token 7: you?
package org.mano.example;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

   public static void main(String[] args) {
      String s="Hi! I am good. How about you?";

      Pattern pattern = Pattern.compile("([A-Z]
         [^\\.?]*[\\.!?])");
      Matcher matcher = pattern.matcher(s);

      for(int i=1; matcher.find();i++)
         System.out.println("Token "+i+": "
            + matcher.group());

   }
}

Output:

Token 1: Hi! I am good.
Token 2: How about you?

Conclusion

String tokenization is a way to break a string into several parts. StringTokenizer is a utility class to extract tokens from a string. However, the Java API documentation discourages its use, and instead recommends the split method of the String class to serve similar needs. The split method uses Regular Expression. There are a classes in the java.util.regex package specifically dedicated to Regular Expression, called Pattern and Matcher. The split method, though, uses Regular Expression; it is convenient to use the Pattern and Matcher classes when dealing with complex expressions. Otherwise, in a very simple circumstance, the split method is quite convenient.


Tags: Java, parsing, Java APIs, tokenization, Regular Expression, tokenizer, method, string class, delimiter




Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel