March 3, 2021
Hot Topics:

The Voice of XML

  • By Michael Classen
  • Send Email »
  • More Articles »

VoiceXML, an XML vocabulary for specifying IVR (Integrated Voice Response) Systems was submitted to the W3C more than one year ago. Initially it received little attention but now with more services like Tellme and BeVocal providing developer platforms for such applications the interest level has risen dramatically in the last couple of months.

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

The language describes the human-machine interaction provided by voice response systems, which includes:

  • Output of synthesized speech (text-to-speech).
  • Output of audio files.
  • Recognition of spoken input.
  • Recognition of DTMF input (touch tone).
  • Recording of spoken input.
  • Telephony features such as call transfer and disconnect.

Not all of these capabilities are mandatory for a VoiceXML platform, but at least output of synthesized speech and recognition of touch tones are required.

The language provides means for collecting character and/or spoken input, assigning the input to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

A Sample Conversation

A phone conversation with a VoiceXML system at your bank could go like this:

System: Welcome to Big Buck's Bank.
Press or Say "one" for Account Balance Inquiry, "two" to speak to an operator.
You: One
System: Please type in or spell out your account number.
You: 123456
System: Please enter or spell your PIN for your account 123456.
You: PIN
System: Please enter or spell your four digit personal identification number, PIN.
You: 1234
System: Thank you. The balance on your account is ten dollars. If you wish to establish a credit line, please answer "yes", otherwise "no".
You: No, thanks.
System: Thank you and Goodbye.

Note how the system is echoing spoken or keyed in information and uses it to look up bank information such as the account balance in backend systems. Built-in error handling prompts again for wrong or misunderstood data.


A VoiceXML application consists of a a set of documents that describe a finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs pointing to the next document and dialog to use. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

Dialogs and Subdialogs

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of field item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance.

Fields are the major building blocks of forms. A field declares a variable and specifies the prompts, grammars, DTMF sequences, help messages, and other event handlers that are used to obtain it. Each field declares a VoiceXML field item variable in the formMs dialog scope. These may be submitted once the form is filled, or copied into other variables.

<form id="balance_info">       
 <block>Welcome to the account balance inquiry
 <field name="account" type="digits">       
  <prompt>What account number?</prompt>       
  <catch event="help">       
     Please speak the account number for which you
     want the balance.       
 <field name="pin" type="digits">       
  <prompt>Your PIN?</prompt>       
  <submit next="/servlet/balance" namelist="account pin"/>

Each field has its own speech and/or DTMF grammars, specified explicitly using <grammar> and <dtmf> elements, or implicitly using the type attribute. The type attribute is used for standard built-in grammars, like digits, boolean, or number. The type attribute also governs how that fieldMs value is spoken by the speech synthesizer.

Each field can have one or more prompts. If there is one, it is repeatedly used to prompt the user for the value until one is provided. If there are many, they must be given count attributes. These determine which prompt to use on each attempt.

A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

  <prompt>Welcome to Big Buck's Bank. Say one of: <enumerate/></prompt> 
  <choice next="/servlet/account.vxml"> 
  <choice next="http://www.ft.com/news.vxml"> 
  <choice next="/servlet/operator.vxml"> 
  <noinput>Please say one of <enumerate/></noinput> 

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Local data, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

Page 1 of 2

This article was originally published on December 7, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date