VoiceThe Voice of XML

The Voice of XML

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

VoiceXML, an XML vocabulary for specifying IVR (Integrated Voice Response) Systems was submitted to the W3C more than one year ago. Initially it received little attention but now with more services like Tellme and BeVocal providing developer platforms for such applications the interest level has risen dramatically in the last couple of months.

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

The language describes the human-machine interaction provided by voice response systems, which includes:


  • Output of synthesized speech (text-to-speech).
  • Output of audio files.
  • Recognition of spoken input.
  • Recognition of DTMF input (touch tone).
  • Recording of spoken input.
  • Telephony features such as call transfer and disconnect.

Not all of these capabilities are mandatory for a VoiceXML platform, but at least
output of synthesized speech and recognition of touch tones are required.

The language provides means for collecting character and/or spoken input, assigning the input to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

A Sample Conversation

A phone conversation with a VoiceXML system at your bank could go like this:

System: Welcome to Big Buck’s Bank.
Press or Say “one” for Account Balance Inquiry, “two” to speak to an operator.
You: One
System: Please type in or spell out your account number.
You: 123456
System: Please enter or spell your PIN for your account 123456.
You: PIN
System: Please enter or spell your four digit personal identification number, PIN.
You: 1234
System: Thank you. The balance on your account is ten dollars. If you wish to establish a credit line, please answer “yes”, otherwise “no”.
You: No, thanks.
System: Thank you and Goodbye.

Note how the system is echoing spoken or keyed in information and uses it to look up bank information such as the account balance in backend systems. Built-in error handling prompts again for wrong or misunderstood data.

Concepts

A VoiceXML application consists of a a set of documents that describe a finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs pointing to the next document and dialog to use. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

Dialogs and Subdialogs

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of field item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance.

Fields are the major building blocks of forms. A field declares a variable and specifies the prompts, grammars, DTMF sequences, help messages, and other event handlers that are used to obtain it. Each field declares a VoiceXML field item variable in the formMs dialog scope. These may be submitted once the form is filled, or copied into other variables.

<form id="balance_info">       
 <block>Welcome to the account balance inquiry
  service.</block>       
 <field name="account" type="digits">       
  <prompt>What account number?</prompt>       
  <catch event="help">       
     Please speak the account number for which you
     want the balance.       
  </catch>       
 </field>       
 <field name="pin" type="digits">       
  <prompt>Your PIN?</prompt>       
 </field>       
 <block>       
  <submit next="/servlet/balance" namelist="account pin"/>
 </block>       
</form>       

Each field has its own speech and/or DTMF grammars, specified explicitly using <grammar> and <dtmf> elements, or implicitly using the type attribute. The type attribute is used for standard built-in grammars, like digits, boolean, or number. The type attribute also governs how that fieldMs value is spoken by the speech synthesizer.

Each field can have one or more prompts. If there is one, it is repeatedly used to prompt the user for the value until one is provided. If there are many, they must be given count attributes. These determine which prompt to use on each attempt.

A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

<menu> 
  <prompt>Welcome to Big Buck's Bank. Say one of: <enumerate/></prompt> 
  <choice next="/servlet/account.vxml"> 
     Account
  </choice> 
  <choice next="http://www.ft.com/news.vxml"> 
     News
  </choice> 
  <choice next="/servlet/operator.vxml"> 
     Operator
  </choice> 
  <noinput>Please say one of <enumerate/></noinput> 
</menu> 

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Local data, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

Sessions

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

Applications

An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root documentMs variables are available to the other documents as application variables, and its grammars can also be set to remain active for the duration of the application.

Grammars

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialogMs grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialogMs active grammars, execution transitions to that other dialog, with the userMs utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

<link event="help"> 
  <grammar type="application/x-jsgf"> 
    [please] help [me] [please] |
    [please] I (need|want) help [please] 
  </grammar> 
</link> 

Events

VoiceXML provides a form-filling mechanism for handling “normal” user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn’t respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Catch elements are also inherited from enclosing elements “as if by copy.” In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

 <catch event="help">       
     Please speak the account number for which you
     want the balance.       
  </catch>       

Links

A link supports mixed initiatives. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the linkMs grammar, control transfers to the linkMs destination URI. A can be used to throw an event to go to a destination URI.

<link next="/servlet/account.vxml"> 
  <grammar type="application/x-jsgf"> 
       account | Account balance inquiry
  </grammar> 
  <dtmf>1</dtmf> 
</link> 

Architecture

A document server (e.g. a Web server) processes requests from a client application using the VoiceXML Interpreter through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.

Applications

Here are a few ideas for voice applications:

Information retrieval applications: Output tends to be pre-recorded information, and voice input is often constrained to a few navigation commands and limited data entry (e.g., “previous,” “next” to control the data flow). Information retrieval applications can provide news, sports, traffic, weather, and stock information, as well as more specialized information (e.g., intranet-based company news). Voice output could be used extensively in applications, for instance to give driving directions.

Electronic commerce: Customer service applications such as account status (see our earlier example),
package tracking, and call centers are well-suited. Financial applications for banking, stock quotes and trading, seem feasible, too.

Telephone services: Voice dialing, telephone conference room management can be voice-enabled using VoiceXML. An organization can make available a voice Web site with company information, news, upcoming events, and an address book. The address book could be used in voice dialing people in that organization.

Unified messaging applications can leverage VoiceXML. E-mail messages can be read over the phone, outgoing e-mail can be recorded (and in the future transcribed) over the phone, and voice-oriented address information can be synchronized with personal organizers and e-mail systems. Pager messages can be originated from the phone, or routed to the phone.

Intranet applications for inventory control, supply chain management, and human resource services can be voice-enabled with VoiceXML since the security mechanisms of the Web apply there, too.

There are many other areas where voice services will be used. While all VoiceXML services will benefit visually impaired people, it may be that other VoiceXML services will be specially created for this community.

Conclusion

Voice-enabled applications will grow by leaps and bounds in the next couple of months, and any service that can be requested through an HTML form could also be made available through VoiceXML. If a clean distinction between logic and presentation exists in your scripts and servlets for Web-based applications, these might even be reusable to power voice applications, just changing the presentation layer from HTML to VoiceXML. Good application architecture pays off sometimes…

Further Reading

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories