February 24, 2021
Hot Topics:

VoiceXML Developer Series, Introduction

  • By Jonathan Eisenzopf
  • Send Email »
  • More Articles »

The goal of the VoiceXML Developer series is to provide a complete series of tutorials that gives developers the insight they need to develop professional quality VoiceXML applications. It's recommended that you read through each edition sequentially to attain a thorough understanding of all of the concepts as they increase in difficulty. A thorough understand of XML basics and server-side scripting is assumed.

VoiceXML is an XML format that utilizes existing telephony technology to interact with users over the telephone through speech recognition, speech synthesis, and standard Web technologies. The first edition of the VoiceXML Developer series will provide you with a synopsis of VoiceXML and a glimpse into the technology used to develop VoiceXML applications. Subsequent editions will go into the specific details of creating VoiceXML applications.


The VoiceXML 1.0 specification was released on March 2000 by the VoiceXML Forum which was founded by technologists from Lucent, AT&T, IBM, and Motorola. The group was formed out of the need to create a unified standard for voice dialogs rather than requiring customers to learn several XML specifications that had been developed internally within each of the member's respective research labs (starting as early as 1995). Other non-founders had also experimented with voice dialog XML formats including HP's TalkML and Sun's Java Speech Markup Language (JSML).

All of this led up to October 2000, when the VoiceXML Forum released VoiceXML 1.0 to the Voice Browser Group (founded in 1998) of the World Wide Web Consortium (W3C), the recognized standards body for the Web. This independent body has been working on the second version of the specification and have announced that it will release a revised specification sometime towards the end of 2001.

The nascent industry has grown rapidly since its millennium debut into a market that is expected to reach $200 million dollars in 2001 and reach $24 billion by 2005. The industry has been driven in part by an existing marketplace that has utilized Interactive Voice Response (IVR) systems for call center automation; think "Press 1 for your account balance. Press 2 to transfer fundsquot;. You've probably used such a system to check your bank or credit card balances.

So VoiceXML fills an existing need for automation by improving upon the current technology and making it simpler to implement and integrate into the rest of the enterprise. VoiceXML also provides a new opportunity for companies that have not been able to afford the cost or complexity of an IVR system by using standard telephony components and leverage its existing Web infrastructure, applications, and developer skills.


A VoiceXML system is made up of of a VoiceXML gateway that accesses static or dynamic VoiceXML content on the Web. The gateway contains a VoiceXML browser (interpreter), Text-To-Speech (TTS), Automatic Speech Recognition (ASR), and the telephony hardware that connects to the Public Switched Telephone Network (PSTN) via a T1, POTS, or ISDN telephone connection. A Plain Old Telephone Server (POTS) line is the type that's installed in your home and can only handle a single connection whereas a T1 contains 24 individual phone lines.

A voice dialog typically consist of the following steps:

  1. Caller dials up the system on a fixed or mobile telephone which is picked up by the telephony hardware which passes the call to the VoiceXML browser.
  2. The VoiceXML gateway retrieves a VoiceXML document from the specified Web server and plays a pre-recorded or synthesized prompt.
  3. The user speaks into the telephone or presses a key on their phone keypad (called DTMF tones)
  4. The telephony equipment passes the recorded sound to the ASR engine (if it's speech), which uses a predefined grammar contained in the VoiceXML document.
  5. The VoiceXML browser executes the commands in the document based upon the ASR results (a match against the grammar or not) and plays another pre-recorded or synthesized prompt and waits for the user's response.

Speech Recognition (ASR)

There are three leading products in the speech recognition world today; Via Voice from IBM, Nuance 7 from Nuance, and OpenSpeech Recognizer from Speechworks. The leading non-commercial ASR is Sphinx, a project maintained by the speech group at Carnegie Mellon University. ASR works by taking recorded audio from a telephony card and using advanced algorithms to match it against dictionary and grammars. A grammar defines sets of words and phrases that it expects the users to speak.

Let's use a stock trading example. We might want to define a grammar in that recognizes the action the user wants to take (buy or sell), the number of shares to trade, and the name or stock symbol of the company to trade. So we would need to break the grammar down into the following parts:

  • Recognize whether the user wants to buy or sell stock.
  • Recognize the company name and associates it with a stock symbol
  • Recognize the number of shares to trade

We would create a grammar rule for each item above and associate each with a VoiceXML form field so that when the user says something like:

I want to sell 2000 shares of Microsoft stock.

the system will recognize that:

  • the user wants to sell rather than buy
  • the user wants to trade 2000 shares
  • the user wants to trade Microsoft stock

This information would then be returned back to the VoiceXML interpreter which propagates the results into the VoiceXML field values which are in turn submitted to a back-end script for processing.

Text-To-Speech (TTS)

The final leaders in the speech synthesis (TTS) world are less clear, but the current leaders are Speechify from Speechworks, Vocalizer from Nuance, and Fonix. Even the best TTS engine is still sub-par for most listeners, so limit TTS use to dynamic content that can't be pre-recorded by a professional voice talent. It's possible that we'll see high-quality speech synthesis using limited domain synthesis techniques from companies like Cepstral, but the timing of this technology being introduced as a mainstream technology remains elusive.

TTS engines work using a number of algorithms that take pre-recorded speech to form the sounds for words. As a starting point, the basic phonemes of the language to be spoken (English) are recorded and filed away. These phonemes are then combined to form words using a lexicon that tells the TTS what phonemes make up a particular word. The words are combined to form sentences and so on until the TTS has built the entire phrase, which is usually returned as a wav file.

VoiceXML contains elements that control things such as volume, speech, and pitch. Unfortunately, vendors implement these features differently so tuning to your specific platform is required.

Telephony Equipment

VoiceXML gateways contain one or more telephony cards that handle things such as digital signal processing, call control, and call bridging. The leading card manufacturers are Dialogic (owned by Intel), Natural MicroSystems, Brooktrout, and Acculabs. For the most part, VoiceXML abstracts the existance of this hardware. The developer is able to focus completely on developing the VoiceXML content generated by the Web server rather than programming telephony cards. Most of the vendors support a wide range of connection options including T1, E1, ISDN, POTS, and ISDN.

Page 1 of 2

This article was originally published on October 1, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date