January 21, 2021
Hot Topics:

The Voice of XML

  • By Michael Classen
  • Send Email »
  • More Articles »


A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.


An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root documentMs variables are available to the other documents as application variables, and its grammars can also be set to remain active for the duration of the application.


Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialogMs grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialogMs active grammars, execution transitions to that other dialog, with the userMs utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

<link event="help"> 
  <grammar type="application/x-jsgf"> 
    [please] help [me] [please] |
    [please] I (need|want) help [please] 


VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Catch elements are also inherited from enclosing elements "as if by copy." In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

 <catch event="help">       
     Please speak the account number for which you
     want the balance.       


A link supports mixed initiatives. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the linkMs grammar, control transfers to the linkMs destination URI. A can be used to throw an event to go to a destination URI.

<link next="/servlet/account.vxml"> 
  <grammar type="application/x-jsgf"> 
       account | Account balance inquiry


A document server (e.g. a Web server) processes requests from a client application using the VoiceXML Interpreter through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.


Here are a few ideas for voice applications:

Information retrieval applications: Output tends to be pre-recorded information, and voice input is often constrained to a few navigation commands and limited data entry (e.g., "previous," "next" to control the data flow). Information retrieval applications can provide news, sports, traffic, weather, and stock information, as well as more specialized information (e.g., intranet-based company news). Voice output could be used extensively in applications, for instance to give driving directions.

Electronic commerce: Customer service applications such as account status (see our earlier example), package tracking, and call centers are well-suited. Financial applications for banking, stock quotes and trading, seem feasible, too.

Telephone services: Voice dialing, telephone conference room management can be voice-enabled using VoiceXML. An organization can make available a voice Web site with company information, news, upcoming events, and an address book. The address book could be used in voice dialing people in that organization.

Unified messaging applications can leverage VoiceXML. E-mail messages can be read over the phone, outgoing e-mail can be recorded (and in the future transcribed) over the phone, and voice-oriented address information can be synchronized with personal organizers and e-mail systems. Pager messages can be originated from the phone, or routed to the phone.

Intranet applications for inventory control, supply chain management, and human resource services can be voice-enabled with VoiceXML since the security mechanisms of the Web apply there, too.

There are many other areas where voice services will be used. While all VoiceXML services will benefit visually impaired people, it may be that other VoiceXML services will be specially created for this community.


Voice-enabled applications will grow by leaps and bounds in the next couple of months, and any service that can be requested through an HTML form could also be made available through VoiceXML. If a clean distinction between logic and presentation exists in your scripts and servlets for Web-based applications, these might even be reusable to power voice applications, just changing the presentation layer from HTML to VoiceXML. Good application architecture pays off sometimes...

Further Reading

Page 2 of 2

This article was originally published on December 7, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date