http://www.developer.com/

Back to article

Processing Speech with Java


September 26, 2002


This is Chapter 12: Processing Speech with Java from the book Java 2 Unleashed, Sixth Edition (ISBN:0-672-32394-X) written by Stephen Potts, Alex Pestrikov, and Mike Kopack, published by Sams Publishing.


In This Chapter

  • Understanding Java Speech

  • Creating and Allocating the Speech Engine

  • Speech Synthesis

  • Speech Recognition

In the 1990s, engineers and programmers got a new dose of our favorite show in Star Trek: The Next Generation. When Captain Picard sat in his chair on the bridge and spoke to the starship's computers, we saw the vision of what the world would be like if voice-driven systems were to become a reality.

Ever since 2001: A Space Odyssey premiered with its talking computer, Hal, the public has been waiting for voice-driven systems to become a reality. Who can forget the computer in War Games that said "Shall we play a game?" Now, after nearly 40 years of experimentation and uncountable sums of money have been expended, we are still waiting. That future is still possible, but the good news is that voice-driven systems are becoming more common. Most of us have encountered a voice-driven system that asks us to "press or say one to speak to the appointment desk." The material covered in this chapter will teach you how to write systems that can respond to the spoken word as these systems do.

In this chapter, we will learn about getting computers to accept sounds as inputs and provide them to us as outputs. To do this, we will first learn how to get a computer to speak to us. We will also learn how a computer can be made to understand our world and react to it.

Understanding Java Speech

Speech is such a common subject that whenever we bring it up as a topic of conversation, our friends look at us as if we are a little strange. We all speak, and none of us can remember a time when we didn't.

There is a lot to know about phonetics and language, and we need to understand more than a little of it if we are going to become good speech programmers. Although it is true that the software engineers at the tool vendors do much of the hard work associated with programming speech, we will not be able to take advantage of the tools they provide unless we understand the subject.

Computerized speech can be divided into two categories: speech recognition and speech synthesis. Speech recognition is the art of turning analog sound waves captured by a microphone into words. These words are either commands to be acted on or data to be stored, displayed, manipulated, and so on.

Speech synthesis is the art of taking the written word and transforming it into analog waveforms that can be heard over a speaker. When looked at in this light, the problem of teaching a computer how to listen and talk seems a little daunting.

Take yourself back mentally to the mid 1970s. Imagine for a moment that you have been given the task of taking one of the mainframe computers in the data center and teaching it to read out loud. You sit down at the computer terminal, and what do you type? What language do you write this system in? What speakers will you use? Where will you plug them in? What kind of pronunciation will you use? How will you generate the waveforms for each syllable? How will you time the output? How will punctuation be handled?

When we think about these issues, we are glad that we are living now. In the 1970s, this entire subject was in the hands of the researchers. Even now, while some commercial applications of speech synthesis have been written, it is still a fertile subject for Ph.D. and graduate students.

Our job as application programmers is much easier than it would have been in the 1970s because of two developments. The first is the creation and marketing of commercial speech products. The second is the creation of the Java Speech API by Sun, in conjunction with a number of other companies interested in this subject.

The Java Speech API is a set of abstract classes and interfaces that represent a Java programmer's view of a speech engine, but it makes no assumptions about the underlying implementation of the engine. This engine can be either a hardware or a software solution (or a hybrid of the two). It can be implemented locally or on a server. It can be written in Java or in any other language that can be called from Java. In fact, different compliant engines can even have different capabilities. One of them might have the capability to learn your speech patterns, whereas another engine might choose not to implement this. It is the engine's responsibility to handle any situations that it does not support in a graceful manner.

The Java Speech API is slightly different from other Java Extensions in that Sun Microsystems doesn't provide a reference implementation for it. Instead, Sun provides a list of third-party vendors who have products that provide a Java Speech API interface. The official Java Speech Web site (http://java.sun.com/products/java-media/speech) lists the following companies as providers of Java-compatible speech products:

  • FreeTTS—This is an open source speech synthesizer written entirely in Java.

  • IBM's Speech for Java—This implementation is based on the IBM ViaVoice product. You must purchase a copy of ViaVoice for this product to work.

  • The Cloud Garden—This implementation will work with any speech engine that is based on Microsoft Speech API (SAPI) version 5.

  • Lernout and Hauspie's TTS for Java Speech API—This package runs on Sun and provides a number of advanced features.

  • Conversa Web 3.0—This product is a speech-enabled Web browser.

  • Festival—This product comes from Scotland and is Unix based. It supports a number of programming interfaces in addition to Java.


Note - This list is subject to change as new products are introduced. You should consult the Web site for the latest version.


The examples in this chapter will be created using IBM's Speech for Java, which runs on top of IBM's ViaVoice product. This product was selected for this book because it is widely available both in retail stores and over the Web. Careful evaluation of the preceding products should be undertaken before choosing your production speech engine vendor. Each of them offers a different feature set, platform support, and pricing structure. Figure 12.1 shows the architecture of the Java Speech API layered on top of the IBM products.


Figure 12.1
The Java Speech architecture.

The Java Speech API is really just a layer that sits atop the speech processing software provided by the vendor, who is IBM in our case. (IBM's ViaVoice is a commercial software product that allows users to dictate letters and figures using products such as Microsoft Office.) ViaVoice, in turn, communicates with the sound card via drivers provided by the sound card manufacturer. It receives input from the microphone port on the sound card, and it provides output via the speaker port on the same sound card.

The existence of the Java Speech API is important because of the power it provides. This enables Java applications to make calls to speech processing objects as easily as it makes calls to an RMI server. This brings the portability of Java into play, as well as a shortened learning curve.

The magic of the application is really in the ViaVoice layer. In this layer, the heavy lifting is performed. We will look at what this magic entails in the sections that follow.

Creating and Allocating the Speech Engine

Let's work on an example to help us understand how to get speech to work on your machine. The first step that we need to take is to install one of the commercial products listed previously and get it running. For these examples, we installed ViaVoice and IBM Speech for Java.

Follow these steps if you are using ViaVoice on a PC. If you are on another platform, or if you are using a product other than ViaVoice, follow the vendor's directions on how to install it.

  1. Install ViaVoice according to the instructions that ship with the product.

  2. Download and install IBM's Speech for Java. It is free from the IBM Web site at http://www.alphaworks.ibm.com/tech/speech.

  3. Follow the directions in Speech for Java about setting the classpath and path properly.

  4. Run the test programs that ship with the IBM products to make sure that the setup is correct. If they work, you are ready to run an example.

The first step in writing an example is the creation and allocation of the Speech Engine itself. The example shown in Listing 12.1 does just that.

Listing 12.1 The HelloUnleashedReader Class

*
 * HelloUnleashedReader.java
 *
 * Created on March 5, 2002, 3:32 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloUnleashedReader
{

  public static void main(String args[])
  {
    try
    {
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));
      synth.allocate();
      synth.resume();

      // Speak the "Hello, Unleashed Reader" string
      synth.speakPlainText("Hello, Unleashed Reader!", null);
      System.out.println(
        "You should be hearing Hello, Unleashed Reader now");

      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);

      // release the resources
      synth.deallocate();
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

First, we need to look at the packages that are used to drive the speech engine:

import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

The javax.speech package contains the Engine interface. Classes that extend this interface are the primary speech processing classes in Java.

The javax.speech.synthesis package contains the interfaces and classes that pertain to "reading" speech aloud.

The java.util.Locale class deals with internationalization. Because spoken language is one of the most locale dependent activities that humans participate in, very little can be done in this area without a locale specified. If you do not set a locale in your program, the default locale for your machine will be used. See Chapter 25, "Java Internationalization," for details on this package.

Before we can do anything, we need to create an object of type Synthesizer. Synthesizer is an interface, so it will be handed an object that implements it. Because the Java program has no way of knowing what the class of the actual object will be, it assigns this object to an interface handle:

      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

The Central class is one of those magic Factory classes that works behind the scenes to create a class for you that fits your specification. See Chapter 21, "Design Patterns in Java," for an explanation of the Factory pattern.

In this case, we are asking the Central class to create a Synthesizer that speaks English and to give us a handle to it. It is possible to create a synthesizer without specifying a locale. In this case, if an engine is already running on this computer, it will be selected as the default. Otherwise, a behind-the-scenes call is made to java.util.Locale. getDefault() to obtain the locale for the computer running the engine. The best engine for this locale will be created.

Armed with a handle, we are ready to synthesize. Before we can use our new toy, however, we have to allocate it. The allocate() method gathers all the resources needed to get the synthesizer running. Engines are not automatically allocated when they are created for two reasons. The first reason is that the creation of the engine is a very expensive activity (in processing terms). In order to improve performance, you have an opportunity to allocate the engine in a separate thread while you program does other work in the foreground. The second reason is that the engine needs exclusive access to certain resources such as the microphone (for recognizer applications). It is wise to allow the program to allocate and deallocate if it chooses to in order to avoid contention:

      synth.allocate();

The resume() method is the complement of the pause() method. Because we are ready to play the message now, we issue the resume() command:

      synth.resume();

The speakPlainText() command tells the synthesizer to say something. In this case, we tell it to say "Hello, Unleashed Reader". We will discuss the specifics of the synthesis commands when we deal with speech in a later section of this chapter.

      // Speak the "Hello, Unleashed Reader" string
      synth.speakPlainText("Hello, Unleashed Reader!", null);

Next, we have to tell the synthesizer to wait until the queue is empty before running the rest of our program. This ensures that the resources will be held long enough to finish.

      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);

The deallocate() method releases all the resources reserved for this program's use when the allocate() method was called:

      synth.deallocate();

It is wise to deallocate the resources and allocate them again if you might not use the synthesizer for an extended period of time or if some of the exclusively held resources might be needed elsewhere.

This architecture reminds us of the way that Java itself is organized. For our programs to be portable, they must not have any concrete connection to the underlying hardware. This is the reason that abstraction layers such as the Java Speech APIs are so valuable. The same programming interface can support a number of underlying products in a very similar and sometimes identical manner.

The output of this program is audible. You should hear the phrase "Hello Unleashed" in your speakers or headphone. In addition, you will see the following line on the console for this application:

You should be hearing Hello, Unleashed Reader now.

Engine States

The engine itself moves through certain states. These states keep the engine behaving properly by allowing it to change its behavior when conditions change, as well as in response to user actions.

Four different allocation states exist that the engine passes through while doing its work: DEALLOCATED, ALLOCATED, ALLOCATING_RESOURCES, and DEALLOCATING_RESOURCES. The ALLOCATED state is the only one that allows any voice synthesis to be performed.

While the engine is in the allocated state, it can also have several substates. One substate is PAUSED, and its converse is RESUMED.

Another set of substates that are independent of the paused/resumed pair is the QUEUE_EMPTY and the QUEUE_NOT_EMPTY state. While there is data to be processed on the queue, the state is QUEUE_NOT_EMPTY. Otherwise, the state is QUEUE_EMPTY. These two states apply to speech synthesizing but not to recognition. Listing 12.2 shows a simple speech-synthesizing example with the states printed out.

Listing 12.2 The HelloUnleashedStates Class

/*
 * HelloUnleashedStates.java
 *
 * Created on March 5, 2002, 3:32 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloUnleashedStates
{
  private static void printState(Synthesizer synth)
  {
    System.out.println("The current States are:");
    if(synth.testEngineState(Synthesizer.QUEUE_EMPTY))
      System.out.println("State = QUEUE_EMPTY");
    if(synth.testEngineState(Synthesizer.QUEUE_NOT_EMPTY))
      System.out.println("State = QUEUE_NOT_EMPTY");
    if(synth.testEngineState(Engine.ALLOCATED))
      System.out.println("State = ALLOCATED");
    if(synth.testEngineState(Engine.DEALLOCATED))
      System.out.println("State = DEALLOCATED");
    if(synth.testEngineState(Engine.ALLOCATING_RESOURCES))
      System.out.println("State = ALLOCATING_RESOURCES");
    if(synth.testEngineState(Engine.DEALLOCATING_RESOURCES))
      System.out.println("State = DEALLOCATING_RESOURCES");



    if(synth.testEngineState(Engine.RESUMED))
      System.out.println("State = RESUMED");
    if(synth.testEngineState(Engine.PAUSED))
      System.out.println("State = PAUSED");

  }

  public static void main(String args[])
  {
    try
    {
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));
      printState(synth);
      synth.allocate();
      printState(synth);
      synth.resume();
      printState(synth);

      // Speak the "Hello, Unleashed States" string
      synth.speakPlainText("Hello, Unleashed States!", null);
      printState(synth);

      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);
      printState(synth);

      // release the resources
      synth.deallocate();
      printState(synth);
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

The output from running this example is shown in the following. We can see the progression of states as the synthesizer allocates the resources, reads in the phrase, places it on the queue, processes it off the queue, and deallocates the resources:

The current States are:
State = QUEUE_EMPTY
State = DEALLOCATED
State = RESUMED
The current States are:
State = QUEUE_EMPTY
State = ALLOCATED
State = RESUMED
The current States are:
State = QUEUE_EMPTY
State = ALLOCATED
State = RESUMED
The current States are:
State = QUEUE_NOT_EMPTY
State = ALLOCATED
State = RESUMED
The current States are:
State = QUEUE_EMPTY
State = ALLOCATED
State = RESUMED
The current States are:
State = QUEUE_EMPTY
State = DEALLOCATED
State = RESUMED

In addition to the previously printed lines, the phrase "Hello, Unleashed States" is audible. From this example, it is clear to see how the different states are changed independently of each other.

Allocating in a Thread

Speech programs require quite a bit of processing power. This might present a problem in the perceived performance of your program. One approach that can be used to address this performance is to allocate the engine in a thread separate from the main program, thereby improving perceived performance. All you have to do is declare an anonymous inner class that runs the allocate() method, trap any exceptions, and execute the start() method as shown here in Listing 12.3.

Listing 12.3 The HelloThread Example

/*
 * HelloThread.java
 *
 * Created on March 5, 2002, 3:32 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloThread
{

  public static void main(String args[])
  {
    Engine eng;
    try
    {
      // Create a synthesizer for English
      final Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

      new Thread(new Runnable()
      {
        public void run()
        {
          try
          {
            synth.allocate();
          }catch (Exception e)
          {
            System.out.println("Exception " + e);
          }
        }
      }).start();

      //Add the rest of you initialization code here


      //wait for the engine to get ready
      synth.waitEngineState(Engine.ALLOCATED);
      synth.resume();

      // Speak the "Hello, Thread" string
      synth.speakPlainText("Hello, Thread!", null);
      System.out.println(
      "You should be hearing Hello, Thread now.");


      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);

      // release the resources
      synth.deallocate();
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

There is no need to name this inner class, so we just begin the statement by declaring it with the keyword new:

      new Thread(new Runnable()
      {
        public void run()

We perform the allocate on the synthesizer handle just as we did before, only this is done inside the new thread:

            synth.allocate();

Because we are accessing synth from within another class, we have to declare it final. This is okay because we never need to reassign its value:

      final Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

We must coordinate the timing of the two threads for the example to work. Specifically, we must tell the main thread to wait until there is an allocated synthesizer to make calls to:

      synth.waitEngineState(Engine.ALLOCATED);

The output from this application is an audible voice in your speaker or headphone, along with the following line on the console:

You should be hearing Hello, Thread now.

In an example this trivial, a separate thread is obviously overkill. In many applications though, a complex GUI could be created in the time it takes to allocate the resources needed to support the speech engine.

Speech Synthesis

Now that we are familiar with the speech engine, we can use it to do some work for our programs. The example in Listing 12.3 provided the correct audible output for a very simple sentence. As sentences become more complex, it becomes necessary for the programmer to provide direction to the synthesizer on the pronunciation of certain words and phrases. The mechanism for providing this input is called the Java Speech Markup Language (JSML). To illustrate this, run the code in Listing 12.4 and listen to the result.

Listing 12.4 The HelloShares Example

/*
 * HelloShares.java
 *
 * Created on March 5, 2002, 3:32 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloShares
{

  public static void main(String args[])
  {
    try
    {
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));
      synth.allocate();
      synth.resume();

      // Speak the "Hello," string
      synth.speak("I own 1999 shares of stock", null);


      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);

      // release the resources
      synth.deallocate();
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

Caution - There is considerable difference between the operations of different synthesizers. A phrase that is rendered correctly without JSML in one synthesizer might not be rendered correctly in another.


The primary functions of a synthesizer are to manage the queue of text to be output, producing events as these functions proceed and, of course, speaking text:

      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

The simplest method used to speak is called speak(). It considers the JSML commands and produces output. Consider this case:

      synth.speak("I own 1999 shares of stock", null);

The second parameter is to pass a listener class if we want to be notified when events occur.

When you run this program, you will probably hear the phrase, "I own nineteen ninety-nine shares of stock." This sounds like the year 1999 instead of the number 1,999. We understand what this means, but it is not the natural way to say it. We need a way to provide instructions to the speech synthesizer about how a word or number should be pronounced. The Java Speech Markup Language gives us that way.

The Java Speech Markup Language (JSML)

Now that we have succeeded in getting a Java program to say a few words, we drill down to the next goal, which is to get a program to say words as naturally as possible. As with most programming topics, the concept is easy, but the details make it hard. These pesky details include the following:

  • Ambiguity—Consider the sentence, "Team U.S.A. has advanced to the finals." How does the speech synthesizer know that the sentence didn't end after the A? How do we teach it the difference between the noun object and the verb object?

  • Abbreviations—Consider the sentence, "The St. Louis Rams have won the Super Bowl." How does the synthesizer know that St. is pronounced Saint?

  • Acronyms—Abbreviations such as IBM are never spoken as a word (that is, we say I-B-M rather than ibbumm), whereas acronyms such as UNICEF are almost never spelled out (that is, we say unicef rather than U-N-I-C-E-F). How does it know which to do?

  • Dates and numbers—If the synthesizer sees the string 1999, does it pronounce it as the year 1999 or the number one thousand nine hundred and ninety-nine?

  • Computerese—How do we teach a computer to say A-P-I instead of Appi?

  • Foreign words—How does it handle foreign words such as mahimahi or tsunami?

  • Jargon—How well can the synthesizer handle words such as instantiation or objectify?

One way of addressing this problem is provided in the Java Speech Markup Language (JSML). This XML-based language allows the programmer or system manager to add additional markups (special tags) to the text that provide pronunciation hints to the synthesizer. This can greatly improve the accuracy of the "reading," but it requires some additional labor.

JSML is an XML-based language, and it must, therefore, conform to the requirement that its documents be well formed. It is not required that they also be validated, but it is a good idea in production applications.

The fact that a speech synthesizer can accept JSML makes it an XML application. The synthesizer contains an XML parser that is responsible for finding the XML elements in the JSML file or string. The parser extracts the values from these elements and hands them to the synthesizer. The synthesizer is responsible for handling or ignoring them. This parser is nonvalidating, so a DTD is not required. If a tag is encountered that is not understood, the parser ignores it.

You can think of the synthesizer as a kind of audible browser. A graphical browser receives an HTML document that is a combination of data and instructions. The browser parses the file, removes the instructions, and displays the data in accordance with the instructions. The synthesizer receives a JSML that is a combination of data and instructions. The synthesizer parses the file, removes the instructions, and plays the data portion in accordance with the instructions. Table 12.1 shows the element types for JSML.

Table 12.1 JSML Elements

Element

Description

jsml

Root element for JSML documents

div

Marks sentences and paragraphs

voice

Speaking voice for contained text

sayas

Specifies how to say the text

phoneme

Specifies pronunciation

emphasis

Emphasizes the text

break

Specifies a break in the speech

prosody

Indicates the pitch, rate, and volume

marker

Request for notification

engine

Native instructions to a synthesizer


Let's work an example that illustrates how to use JSML to improve speech output. First, we need to learn how to remove the hard-coded text from our program so that it can play different JSML documents. The easiest way to do this is by creating a new class that implements the javax.speech.synthesis.Speakable interface. This interface requires one method, getJSMLText(). It returns a string that contains JSML text. Listing 12.5 shows us an example of this class.

Listing 12.5 The SpeakableDate Class

/*
 * SpeakableDate.java
 *
 * Created on March 12, 2002, 11:54 AM
 */

package unleashed.ch12;

import javax.speech.synthesis.Speakable;
import java.util.Date;

/**
 *
 * @author Stephen Potts
 * @version
 */
public class SpeakableDate implements Speakable
{

  /** Creates new SpeakableDate */
  public SpeakableDate()
  {
  }

  /** getJSMLText is the only method of Speakable */
  public String getJSMLText()
  {
    StringBuffer buf = new StringBuffer();
    String todayString = "3/12/2002";

    // Speak the sender's name slower to be clearer
    buf.append("<jsml>");
    buf.append("Today is " + todayString );
    buf.append("Today is <sayas class=\"date\">"+ todayString +" </sayas>");
    buf.append("</jsml>");

    return buf.toString();
  }
}

This class exists to provide a JSML document to a synthesizer. Some programmers consider it more convenient to create their XML documents in a class such as this one instead of in a file. This class uses StringBuffer while the string is being built up because it performs better than the String class when it is changed frequently.

    buf.append("Today is " + todayString );
    buf.append("Today is <sayas class=\"date\">"+ todayString +" </sayas>");

Two separate strings are added to the JSML document. The first string simply adds the string literal. The second identifies it as a date. Listing 12.6 shows us the synthesizer class that runs this JSML.

Listing 12.6 The JSMLSpeaker Class

/*
 * JSMLSpeaker.java
 *
 * Created on March 5, 2002, 3:32 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class JSMLSpeaker
{

  public static void main(String args[])
  {
    Speakable sAble = new SpeakableDate();
    try
    {
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
          new SynthesizerModeDesc(Locale.ENGLISH));
      synth.allocate();
      synth.resume();

      // Speak the string
      synth.speak(sAble, null);
      System.out.println("You are hearing the JSML output now.");

      // Wait till speaking is done
      synth.waitEngineState(Synthesizer.QUEUE_EMPTY);

      // release the resources
      synth.deallocate();
    } catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

This class resembles the synthesizer classes that we mentioned earlier in that it allocates and deallocates the synthesizer. It is different, though, because it expects to process a Speakable class instead of a string literal. The Speakable class is called SpeakableDate, and it is shown earlier in Listing 12.5.

    Speakable sAble = new SpeakableDate();

Conceptually, this is a class version of a JSML document. We can pass a handle to that class to the synthesizer's speak() method. This method treats it similar to an XML document that is parsed and interpreted as speech.

The output from running this program is audible and visible. The audible portion reads the date as characters with the slashes pronounced:

Today is three slash twelve slash two thousand two

The second time, the date is pronounced more like a date with the slashes omitted:

Today is three twelve two thousand two

In addition, the following visual message is written to the console:

You are hearing the JSML output now.

Listing 12.7 illustrates another example that shows some interesting JSML elements.

Listing 12.7 The SpeakableSlow Class

/*
 * SpeakableSlow.java
 *
 * Created on March 12, 2002, 11:54 AM
 */

package unleashed.ch12;

import javax.speech.synthesis.Speakable;

/**
 *
 * @author Stephen Potts
 * @version
 */
public class SpeakableSlow implements Speakable
{

  /** Creates new SpeakableShares */
  public SpeakableSlow()
  {
  }

  /** getJSMLText is the only method of Speakable */
  public String getJSMLText()
  {
    StringBuffer buf = new StringBuffer();

    // Speak the sender's name slower to be clearer
    buf.append("<jsml>");
    buf.append("I own 1500 shares");
    buf.append("I own 1500 shares.");
    buf.append("<div>I own <PROS RATE=\"-20%\">1,500</PROS> shares</div> ");
    buf.append("I own <PROS RATE=\"-50%\">1500</PROS> shares");
    buf.append("</jsml>");

    return buf.toString();
  }
}

You can parse this document by changing the JSMLSpeaker class that is shown in Listing 12.6 by altering the following line:

    Speakable sAble = new SpeakableSlow();

Note - The code file for this modified example is available in the source code for this chapter and is named JSMLSpeaker2.java.


Several features are illustrated in this example. The first line is read at a normal speed. When the end is reached, it combines the first and second line and then tries to pronounce "sharesI." The first attempt to cure this problem was to add a period at the end of the next line. This causes "dot" to be uttered:

    buf.append("I own 1500 shares");
    buf.append("I own 1500 shares.");

Finally, the elements <div> and </div> identified the text between them as a division, and the synthesizer stopped slurring the words together. A space will work in some instances, but the explicit <div> tag is normally used.

Next, we added some special elements called prosody elements to slow down the speaking of the numbers. Prosody elements provide details on the rate of speech that is desired:

    buf.append("<div>I own <PROS RATE=\"-20%\">1,500</PROS> shares</div> ");

The rate dropped by 20% so that the most important part of the sentence, the number of shares, would be easier to comprehend. Notice the difference between the way that the number is pronounced, depending on the presence or absence of the comma in the numerals. If the comma is present, the number is pronounced one thousand five hundred. Without the comma, it is pronounced like the year 1500.

You can slow the rate down even more if you choose, as we did in the last sentence:

    buf.append("I own <PROS RATE=\"-50%\">1500</PROS> shares");

This will make the output sound as if it needs to take some vitamins.

The output from running this will be the phrase repeated four times with the previously mentioned variations. At the same time, the following phrase will appear on your console to prove that the program ran:

You are hearing the JSML output now.

Finally, we will create a JSML version of the HelloShares example in Listing 12.4. As you recall, the "1999" phrase was pronounced "nineteen ninety-nine" instead of "one thousand nine hundred ninety nine." The JSML version of the program is shown in Listing 12.8.

Listing 12.8 The SpeakableShares Class

/*
 * SpeakableShares.java
 *
 * Created on March 12, 2002, 11:54 AM
 */

package unleashed.ch12;

import javax.speech.synthesis.Speakable;
import java.util.Date;

/**
 *
 * @author Stephen Potts
 * @version
 */
public class SpeakableShares implements Speakable
{
  
  /** Creates new SpeakableShares */
  public SpeakableShares()
  {
  }
  
  /** getJSMLText is the only method of Speakable */
  public String getJSMLText()
  {
    StringBuffer buf = new StringBuffer();
    String shareString = "1,999";
    
    // Speak the shareString
    buf.append("<jsml>");
    buf.append("I own 1999 shares of stock");
    buf.append("I own <sayas class=\"number\">"+ shareString 
               +" </sayas>" + " shares of stock");
    buf.append("</jsml>");
    
    return buf.toString();
  }
}

Notice that the shareString contains a comma between the first and second digits:

    String shareString = "1,999";

That, in combination with the "number" parameter in the <sayas> tag, provides an unambiguous specification for this pronunciation.

You can run this code by running the program called unleashed.ch12. JSMLSpeaker3.java. This class is identical to the previous one except that it instantiates the SpeakableShares class. It is also included with the code download for this chapter.

The result of running this example is both audible and visual. First you will hear 1999 pronounced "nineteen ninety-nine." Next, you will hear it pronounced "one thousand nine hundred ninety nine." You will see the usual message on the console:

You are hearing the JSML output now.

Note - Because the Java Speech API is always running on top of a third-party speech product, the behavior that you observe might be different. For example, some products might recognize a period as a sentence ending instead of as a "dot" as we encountered in this chapter.


JSML is a very good example of how to use XML to improve your products. By embedding these controls in the JSML file, it is possible to place the specification of the pronunciation details outside the program and into the data.

Speech Recognition

The other side of Java Speech is speech recognition. As you might have predicted, the state of the art in recognition is not nearly as advanced as speech synthesis. The reason for this is simple; it is a harder topic.

If you are an English speaker, your ear might be finely tuned to the nuances of the language as the native speakers pronounce them from your region of the United States or Canada. If you relocate to another part of your country, or to another English speaking country, such as Australia, your ability to understand the language is diminished for a while. Over time, your brain learns the subtleties of the new dialect, and you once again become a fluent listener.

For a computer, the problem is similar. Recognizers receive information electronically via microphones. They then must try to determine what set of syllables to create from the set of phonemes (sounds) just received. These syllables must then be combined into words.

Recognition Grammars

A grammar simplifies the job of the speech recognizer by limiting the number of possible words and phrases that it has to consider when trying to determine what a speaker has said. There are two kinds of grammars: rule grammars and dictation grammars.

Rule grammars are composed of tokens and rules. When a user speaks, the input is compared to the rules and tokens in the grammar to determine the identity of the word or phrase. An application provides a rule grammar to a recognizer, normally during initialization.

Dictation grammars are built in to the recognizer itself. They define thousands of words that can be spoken in a free form fashion. Dictation grammars come closer to our ultimate goal of unrestricted speech, but, at present, they are slower than rule grammars and more prone to errors.


Note - There are four basic error types that recognizers suffer from regardless of the grammar employed:

  • Failure to recognize a valid word

  • Misinterpreting a word to be another valid word

  • Detecting a word where none was present

  • Failure to recognize that a word was spoken


Java Speech supports dynamic grammars. This means that grammars can be modified at runtime. After a change is made to the grammar, it must be committed using the commitChanges() method of the recognizer. When these changes are committed, they are committed atomically, meaning all at once. Listing 12.9 shows a simple grammar.

Listing 12.9 A Simple Grammar

grammar javax.speech.demo;

public <sentence> = Hello world |
          Hello Java Unleashed |
          Java Speech API |
          computer |
          bye |
          I program computers;

This rule grammar is composed of six different tokens. A recognizer that is working against this grammar will understand no other words, phrases, or parts of phrases. The reason for this is to simplify the processing and increase the likelihood that an accurate result will be obtained.

This rule grammar is formatted in the Java Speech Grammar Format Specification (JSGF). Grammars formatted in JSGF can be converted logically into RuleGrammar objects and back again. (It might look different, but it will be equivalent.)

Armed with a grammar, we need a recognizer program to process speech against it. Listing 12.10 shows a program that will serve as a recognizer for this grammar.

Listing 12.10 The HelloRecognizer Class

/*
 * HelloRecognizer.java
 *
 * Created on March 11, 2002, 9:53 PM
 */

package unleashed.ch12;

/**
 *
 * @author Stephen Potts
 * @version
 */

import javax.speech.*;
import javax.speech.recognition.*;
import java.io.FileReader;
import java.util.Locale;

public class HelloRecognizer extends ResultAdapter
{

  static Recognizer recognizer;
  String gst;

  public void resultAccepted(ResultEvent re)
  {
    try
    {
    Result res = (Result)(re.getSource());
    ResultToken tokens[] = res.getBestTokens();

    for (int i=0; i < tokens.length; i++)
    {
      gst = tokens[i].getSpokenText();
      System.out.print(gst + " ");
    }
    System.out.println();

    if(gst.equals("bye"))
    {
      System.out.println("See you later!");
      recognizer.deallocate();
      System.exit(0);
    }
    }catch(Exception ee)
    {
      System.out.println("Exception " + ee);
    }
  }

  public static void main(String args[])
  {
    try
    {
      recognizer = Central.createRecognizer(
         new EngineModeDesc(Locale.ENGLISH));
      recognizer.allocate();

      FileReader grammar1 =
       new FileReader("c:/unleashed/ch12/SimpleGrammar.txt");

      RuleGrammar rg = recognizer.loadJSGF(grammar1);
      rg.setEnabled(true);

      recognizer.addResultListener(new HelloRecognizer());


    System.out.println("Ready for Input");
      recognizer.commitChanges();

      recognizer.requestFocus();
      recognizer.resume();
    }catch (Exception e)
    {
      System.out.println("Exception " + e);
    }
  }
}

Note - The filename in the FileReader constructor must match the actual filename of the SimpleGrammar.txt file on your computer.


The creation of a recognizer is similar to the creation of a synthesizer. We use the Central class to create both:

      recognizer = Central.createRecognizer(
         new EngineModeDesc(Locale.ENGLISH));

Once again, we have chosen English as the language for this example. Once we have a recognizer, we can load the grammar:

      FileReader grammar1 =
       new FileReader("c:/unleashed/ch12/SimpleGrammar.txt");

      RuleGrammar rg = recognizer.loadJSGF(grammar1);
      rg.setEnabled(true);

We will load the grammar from Listing 12.8, which is stored in the SimpleGrammar.txt file. We create a RuleGrammar object, and set it to be enabled.

The event listener will do the runtime work of the program. We set the event listener here:

      recognizer.addResultListener(new HelloRecognizer());

Next, we complete the initialization of the recognizer by committing the grammar, getting the focus, and putting the recognizer in the RESUMED state:

      recognizer.commitChanges();
      recognizer.requestFocus();
      recognizer.resume();

When a spoken pattern is recognized as part of the grammar, a result event occurs and the ResultAccepted() method is called. The event object contains the information that we need to find out which phrase in the grammar was spoken:

    Result res = (Result)(re.getSource());
    ResultToken tokens[] = res.getBestTokens();

The Result interface documentation says that getBestTokens() guesses at the phrase that has been spoken. This is in reference to the inexact nature of the process of speech recognition. We then extract the string that the recognizer guesses is the correct one:

      gst = tokens[i].getSpokenText();

For all the strings except bye, the result is echoed to the console as shown here:

Hello world
Hello Java Unleashed
computer
I program computers
bye
See you later!

Once the bye string is received, we print it and exit the program, adding a See you later! as a confirmation that we have exited.

Summary

This chapter covers the two primary functions of the Java Speech API—speech synthesis and speech recognition. In addition, you learned about the speech engine that provides services to both of these capabilities.

You learned how to synthesize speech using an implementation of the Synthesizer interface provided by the IBM ViaVoice product. We wrote several programs that produced speech from written input.

You also learned how to use the Java Speech Markup Language (JSML) to give instructions to the speech engine about how to pronounce the words, dates, and numbers that appear in the text. You saw examples of how you can use XML tags to communicate this information to the synthesizer.

Finally, we took a look at the art of recognizing speech with software. We created a simple grammar using the Java Speech Grammar Format (JSGF) and loaded it into a recognizer program. We then spoke into the microphone and watched as our spoken words appeared in the console. In addition, you saw how a command can be tied to a spoken word by the way the word bye was used to close this program.

The subject of speech in Java is larger than a single chapter can cover. This chapter provides enough information so that you will be able to get both a synthesizer and a recognizer working on your computer. Hopefully, you will be able to copy and paste these programs and enhance them to meet the requirements of your projects.

Authors of this Chapter

Stephen Potts is an independent consultant, author, and Java instructor in Atlanta, Georgia (United States). Steve received his computer science degree in 1982 from Georgia Tech. He has worked in a number of disciplines during his 20-year career, with manufacturing being his area of greatest expertise. His previous books include Special Edition Using Visual C++ 4 and Java 1.2 How-To. He can be reached via e-mail at stevepotts@mindspring.com.

Source of this material

This is Chapter 12: Processing Speech with Java from the book Java 2 Unleashed, Sixth Edition (ISBN:0-672-32363-X) written by Stephen Potts, Alex Pestrikov, and Mike Kopack, published by Sams Publishing.

To access the full Table of Contents for the book


Other Chapters from Sams Publishing:

Web Services and Flows (WSFL)
Overview of JXTA
Introduction to EJBs

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date