January 23, 2021
Hot Topics:

Processing Speech with Java

  • By Sams Publishing
  • Send Email »
  • More Articles »

Speech Synthesis

Now that we are familiar with the speech engine, we can use it to do some work for our programs. The example in Listing 12.3 provided the correct audible output for a very simple sentence. As sentences become more complex, it becomes necessary for the programmer to provide direction to the synthesizer on the pronunciation of certain words and phrases. The mechanism for providing this input is called the Java Speech Markup Language (JSML). To illustrate this, run the code in Listing 12.4 and listen to the result.

Listing 12.4 The HelloShares Example

 * HelloShares.java
 * Created on March 5, 2002, 3:32 PM

package unleashed.ch12;

 * @author Stephen Potts
 * @version
import javax.speech.*;
import javax.speech.synthesis.*;
import java.util.Locale;

public class HelloShares

  public static void main(String args[])
      // Create a synthesizer for English
      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

      // Speak the "Hello," string
      synth.speak("I own 1999 shares of stock", null);

      // Wait till speaking is done

      // release the resources
    } catch (Exception e)

Caution - There is considerable difference between the operations of different synthesizers. A phrase that is rendered correctly without JSML in one synthesizer might not be rendered correctly in another.

The primary functions of a synthesizer are to manage the queue of text to be output, producing events as these functions proceed and, of course, speaking text:

      Synthesizer synth = Central.createSynthesizer(
      new SynthesizerModeDesc(Locale.ENGLISH));

The simplest method used to speak is called speak(). It considers the JSML commands and produces output. Consider this case:

      synth.speak("I own 1999 shares of stock", null);

The second parameter is to pass a listener class if we want to be notified when events occur.

When you run this program, you will probably hear the phrase, "I own nineteen ninety-nine shares of stock." This sounds like the year 1999 instead of the number 1,999. We understand what this means, but it is not the natural way to say it. We need a way to provide instructions to the speech synthesizer about how a word or number should be pronounced. The Java Speech Markup Language gives us that way.

The Java Speech Markup Language (JSML)

Now that we have succeeded in getting a Java program to say a few words, we drill down to the next goal, which is to get a program to say words as naturally as possible. As with most programming topics, the concept is easy, but the details make it hard. These pesky details include the following:

  • Ambiguity—Consider the sentence, "Team U.S.A. has advanced to the finals." How does the speech synthesizer know that the sentence didn't end after the A? How do we teach it the difference between the noun object and the verb object?

  • Abbreviations—Consider the sentence, "The St. Louis Rams have won the Super Bowl." How does the synthesizer know that St. is pronounced Saint?

  • Acronyms—Abbreviations such as IBM are never spoken as a word (that is, we say I-B-M rather than ibbumm), whereas acronyms such as UNICEF are almost never spelled out (that is, we say unicef rather than U-N-I-C-E-F). How does it know which to do?

  • Dates and numbers—If the synthesizer sees the string 1999, does it pronounce it as the year 1999 or the number one thousand nine hundred and ninety-nine?

  • Computerese—How do we teach a computer to say A-P-I instead of Appi?

  • Foreign words—How does it handle foreign words such as mahimahi or tsunami?

  • Jargon—How well can the synthesizer handle words such as instantiation or objectify?

One way of addressing this problem is provided in the Java Speech Markup Language (JSML). This XML-based language allows the programmer or system manager to add additional markups (special tags) to the text that provide pronunciation hints to the synthesizer. This can greatly improve the accuracy of the "reading," but it requires some additional labor.

JSML is an XML-based language, and it must, therefore, conform to the requirement that its documents be well formed. It is not required that they also be validated, but it is a good idea in production applications.

The fact that a speech synthesizer can accept JSML makes it an XML application. The synthesizer contains an XML parser that is responsible for finding the XML elements in the JSML file or string. The parser extracts the values from these elements and hands them to the synthesizer. The synthesizer is responsible for handling or ignoring them. This parser is nonvalidating, so a DTD is not required. If a tag is encountered that is not understood, the parser ignores it.

You can think of the synthesizer as a kind of audible browser. A graphical browser receives an HTML document that is a combination of data and instructions. The browser parses the file, removes the instructions, and displays the data in accordance with the instructions. The synthesizer receives a JSML that is a combination of data and instructions. The synthesizer parses the file, removes the instructions, and plays the data portion in accordance with the instructions. Table 12.1 shows the element types for JSML.

Table 12.1 JSML Elements




Root element for JSML documents


Marks sentences and paragraphs


Speaking voice for contained text


Specifies how to say the text


Specifies pronunciation


Emphasizes the text


Specifies a break in the speech


Indicates the pitch, rate, and volume


Request for notification


Native instructions to a synthesizer

Page 3 of 5

This article was originally published on September 26, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date