December 12, 2018
Hot Topics:

Processing Speech with Java

  • September 26, 2002
  • By Sams Publishing
  • Send Email »
  • More Articles »

This is Chapter 12: Processing Speech with Java from the book Java 2 Unleashed, Sixth Edition (ISBN:0-672-32394-X) written by Stephen Potts, Alex Pestrikov, and Mike Kopack, published by Sams Publishing.

In This Chapter

  • Understanding Java Speech

  • Creating and Allocating the Speech Engine

  • Speech Synthesis

  • Speech Recognition

In the 1990s, engineers and programmers got a new dose of our favorite show in Star Trek: The Next Generation. When Captain Picard sat in his chair on the bridge and spoke to the starship's computers, we saw the vision of what the world would be like if voice-driven systems were to become a reality.

Ever since 2001: A Space Odyssey premiered with its talking computer, Hal, the public has been waiting for voice-driven systems to become a reality. Who can forget the computer in War Games that said "Shall we play a game?" Now, after nearly 40 years of experimentation and uncountable sums of money have been expended, we are still waiting. That future is still possible, but the good news is that voice-driven systems are becoming more common. Most of us have encountered a voice-driven system that asks us to "press or say one to speak to the appointment desk." The material covered in this chapter will teach you how to write systems that can respond to the spoken word as these systems do.

In this chapter, we will learn about getting computers to accept sounds as inputs and provide them to us as outputs. To do this, we will first learn how to get a computer to speak to us. We will also learn how a computer can be made to understand our world and react to it.

Understanding Java Speech

Speech is such a common subject that whenever we bring it up as a topic of conversation, our friends look at us as if we are a little strange. We all speak, and none of us can remember a time when we didn't.

There is a lot to know about phonetics and language, and we need to understand more than a little of it if we are going to become good speech programmers. Although it is true that the software engineers at the tool vendors do much of the hard work associated with programming speech, we will not be able to take advantage of the tools they provide unless we understand the subject.

Computerized speech can be divided into two categories: speech recognition and speech synthesis. Speech recognition is the art of turning analog sound waves captured by a microphone into words. These words are either commands to be acted on or data to be stored, displayed, manipulated, and so on.

Speech synthesis is the art of taking the written word and transforming it into analog waveforms that can be heard over a speaker. When looked at in this light, the problem of teaching a computer how to listen and talk seems a little daunting.

Take yourself back mentally to the mid 1970s. Imagine for a moment that you have been given the task of taking one of the mainframe computers in the data center and teaching it to read out loud. You sit down at the computer terminal, and what do you type? What language do you write this system in? What speakers will you use? Where will you plug them in? What kind of pronunciation will you use? How will you generate the waveforms for each syllable? How will you time the output? How will punctuation be handled?

When we think about these issues, we are glad that we are living now. In the 1970s, this entire subject was in the hands of the researchers. Even now, while some commercial applications of speech synthesis have been written, it is still a fertile subject for Ph.D. and graduate students.

Our job as application programmers is much easier than it would have been in the 1970s because of two developments. The first is the creation and marketing of commercial speech products. The second is the creation of the Java Speech API by Sun, in conjunction with a number of other companies interested in this subject.

The Java Speech API is a set of abstract classes and interfaces that represent a Java programmer's view of a speech engine, but it makes no assumptions about the underlying implementation of the engine. This engine can be either a hardware or a software solution (or a hybrid of the two). It can be implemented locally or on a server. It can be written in Java or in any other language that can be called from Java. In fact, different compliant engines can even have different capabilities. One of them might have the capability to learn your speech patterns, whereas another engine might choose not to implement this. It is the engine's responsibility to handle any situations that it does not support in a graceful manner.

The Java Speech API is slightly different from other Java Extensions in that Sun Microsystems doesn't provide a reference implementation for it. Instead, Sun provides a list of third-party vendors who have products that provide a Java Speech API interface. The official Java Speech Web site (http://java.sun.com/products/java-media/speech) lists the following companies as providers of Java-compatible speech products:

  • FreeTTS—This is an open source speech synthesizer written entirely in Java.

  • IBM's Speech for Java—This implementation is based on the IBM ViaVoice product. You must purchase a copy of ViaVoice for this product to work.

  • The Cloud Garden—This implementation will work with any speech engine that is based on Microsoft Speech API (SAPI) version 5.

  • Lernout and Hauspie's TTS for Java Speech API—This package runs on Sun and provides a number of advanced features.

  • Conversa Web 3.0—This product is a speech-enabled Web browser.

  • Festival—This product comes from Scotland and is Unix based. It supports a number of programming interfaces in addition to Java.

Note - This list is subject to change as new products are introduced. You should consult the Web site for the latest version.

The examples in this chapter will be created using IBM's Speech for Java, which runs on top of IBM's ViaVoice product. This product was selected for this book because it is widely available both in retail stores and over the Web. Careful evaluation of the preceding products should be undertaken before choosing your production speech engine vendor. Each of them offers a different feature set, platform support, and pricing structure. Figure 12.1 shows the architecture of the Java Speech API layered on top of the IBM products.

Figure 12.1
The Java Speech architecture.

The Java Speech API is really just a layer that sits atop the speech processing software provided by the vendor, who is IBM in our case. (IBM's ViaVoice is a commercial software product that allows users to dictate letters and figures using products such as Microsoft Office.) ViaVoice, in turn, communicates with the sound card via drivers provided by the sound card manufacturer. It receives input from the microphone port on the sound card, and it provides output via the speaker port on the same sound card.

The existence of the Java Speech API is important because of the power it provides. This enables Java applications to make calls to speech processing objects as easily as it makes calls to an RMI server. This brings the portability of Java into play, as well as a shortened learning curve.

The magic of the application is really in the ViaVoice layer. In this layer, the heavy lifting is performed. We will look at what this magic entails in the sections that follow.

Creating and Allocating the Speech Engine

Let's work on an example to help us understand how to get speech to work on your machine. The first step that we need to take is to install one of the commercial products listed previously and get it running. For these examples, we installed ViaVoice and IBM Speech for Java.

Follow these steps if you are using ViaVoice on a PC. If you are on another platform, or if you are using a product other than ViaVoice, follow the vendor's directions on how to install it.

  1. Install ViaVoice according to the instructions that ship with the product.

  2. Download and install IBM's Speech for Java. It is free from the IBM Web site at http://www.alphaworks.ibm.com/tech/speech.

  3. Follow the directions in Speech for Java about setting the classpath and path properly.

  4. Run the test programs that ship with the IBM products to make sure that the setup is correct. If they work, you are ready to run an example.

The first step in writing an example is the creation and allocation of the Speech Engine itself. The example shown in Listing 12.1 does just that.

Listing 12.1 The HelloUnleashedReader Class

 * HelloUnleashedReader.java
 * Created on March 5, 2002, 3:32 PM

package unleashed.ch12;

 * @author Stephen Potts
 * @version
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloUnleashedReader

  public static void main(String args[])
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

      // Speak the "Hello, Unleashed Reader" string
      synth.speakPlainText("Hello, Unleashed Reader!", null);
        "You should be hearing Hello, Unleashed Reader now");

      // Wait till speaking is done

      // release the resources
    } catch (Exception e)

First, we need to look at the packages that are used to drive the speech engine:

import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

The javax.speech package contains the Engine interface. Classes that extend this interface are the primary speech processing classes in Java.

The javax.speech.synthesis package contains the interfaces and classes that pertain to "reading" speech aloud.

The java.util.Locale class deals with internationalization. Because spoken language is one of the most locale dependent activities that humans participate in, very little can be done in this area without a locale specified. If you do not set a locale in your program, the default locale for your machine will be used. See Chapter 25, "Java Internationalization," for details on this package.

Before we can do anything, we need to create an object of type Synthesizer. Synthesizer is an interface, so it will be handed an object that implements it. Because the Java program has no way of knowing what the class of the actual object will be, it assigns this object to an interface handle:

      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

The Central class is one of those magic Factory classes that works behind the scenes to create a class for you that fits your specification. See Chapter 21, "Design Patterns in Java," for an explanation of the Factory pattern.

In this case, we are asking the Central class to create a Synthesizer that speaks English and to give us a handle to it. It is possible to create a synthesizer without specifying a locale. In this case, if an engine is already running on this computer, it will be selected as the default. Otherwise, a behind-the-scenes call is made to java.util.Locale. getDefault() to obtain the locale for the computer running the engine. The best engine for this locale will be created.

Armed with a handle, we are ready to synthesize. Before we can use our new toy, however, we have to allocate it. The allocate() method gathers all the resources needed to get the synthesizer running. Engines are not automatically allocated when they are created for two reasons. The first reason is that the creation of the engine is a very expensive activity (in processing terms). In order to improve performance, you have an opportunity to allocate the engine in a separate thread while you program does other work in the foreground. The second reason is that the engine needs exclusive access to certain resources such as the microphone (for recognizer applications). It is wise to allow the program to allocate and deallocate if it chooses to in order to avoid contention:


The resume() method is the complement of the pause() method. Because we are ready to play the message now, we issue the resume() command:


The speakPlainText() command tells the synthesizer to say something. In this case, we tell it to say "Hello, Unleashed Reader". We will discuss the specifics of the synthesis commands when we deal with speech in a later section of this chapter.

      // Speak the "Hello, Unleashed Reader" string
      synth.speakPlainText("Hello, Unleashed Reader!", null);

Next, we have to tell the synthesizer to wait until the queue is empty before running the rest of our program. This ensures that the resources will be held long enough to finish.

      // Wait till speaking is done

The deallocate() method releases all the resources reserved for this program's use when the allocate() method was called:


It is wise to deallocate the resources and allocate them again if you might not use the synthesizer for an extended period of time or if some of the exclusively held resources might be needed elsewhere.

Page 1 of 5

Comment and Contribute


(Maximum characters: 1200). You have characters left.



Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.


Thanks for your registration, follow us on our social networks to keep up-to-date