This article is the second in a three-part series that provides an introduction to VoiceXML, as well as SRGS, SSML, and SISR for building conversational web applications. The first installment, Building VoiceXML Dialogs, discussed building VoiceXML dialogs through both menu and form elements. This part will outline how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, the third article will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.
As stated previously in the first article, the web has primarily delivered information and services using visual interfaces and as a result has largely bypassed customers that primarily use the telephone – for which voice input and audio output provide the primary means of interaction.
VoiceXML 2.0 [VXML2] is helping to change that. Building on top of the market established in 1999 by the VoiceXML Forum’s VoiceXML 1.0 specification [VXML1], VoiceXML 2.0 and several complementary standards are changing the way we interact with voice services and applications – by simplifying the way these services and applications are built.
VoiceXML Uses Web Standards
As was mentioned previously, VoiceXML uses web standards such as XML and HTTP. The power of the web is of course in that it brings to each user a worldwide array of information and services, and conversely that it gives each information and service provider a worldwide customer base. Thus a distributed application model is fundamental to the web; VoiceXML builds on the same distributed model that has been so successful for visual web-based services.
The distributed web-based application model as used by VoiceXML services accessed by telephone is illustrated by the following diagram:
As the diagram shows, the architecture for a voice application is the same as that of an HTML application. A browser, either VoiceXML or HTML retrieves a page and then renders it for the user either using audibly or visually, and then the user responds, either via voice in the case of VoiceXML, or with the keyboard and mouse, in the case of HTML.
In addition to the core capabilities described previously, VoiceXML provides a number of more advanced features: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable subdialogs.
Playback of pre-recorded audio prompts is accomplished using the < audio > element from SSML. Recording of user messages is done with the
Context-specific and tapered help are provided by a system of events and event handlers. VoiceXML defines a set of events corresponding for example to a user request for help, a failure of the user to respond within a timeout period, or a user input that doesn’t match an active grammar. The application may then provide in any given context, such as a form or a field, an event handler that responds appropriately to a given event for a particular context. Moreover the help may be tapered: a count may be specified for each event handler so that a different handler is executed depending on how many times the event has occurred in that context. This can be used for example to provided increasingly more detailed messages each time the user asks for help.
Finally, VoiceXML provides support for subdialogs. A < subdialog > is an entire form that is executed, the result of which is used to fill in an input item in the calling form. This feature has two uses: it may be used to provide a disambiguation or confirmation dialog for an input, and it may be used to support reusable dialogs, shared across several VoiceXML applications.
COMPARISON WITH HTML
While VoiceXML reuses many concepts and designs from HTML, it differs in several aspects because of differences between visual and voice interactions.
An HTML document is a single unit that is fetched from a network resource specified by a URI and presented to the user all at once; in contrast, a VoiceXML document contains a number of dialog units – menus or forms – that are presented sequentially. This difference is because the visual medium is capable of displaying a number of items in parallel while the voice medium is inherently sequential.
Thus while a given VoiceXML document may contain the same information as a corresponding HTML document, the VoiceXML document will be structured differently to reflect the sequential nature of the voice medium. So for example the HTML equivalent of the menu in Example 1 above might be:
Please select a service. <a href="http://www.example.com/sports.html"> Sports scores </a> <a href="http://www.example.com/weather.html"> Weather information </a> <a next="#login"> Log in. </a>
In HTML there is no need to identify this menu as a unit, though for accessibility reasons it’s desirable to do so – XHTML2 has an < nl > (navigation list) element for exactly this reason.
VoiceXML on the other hand requires dialog elements (menus and forms) to be identified as distinct units so that they may be presented one at a time to the user. Thus while an HTML document functions in effect as a single dialog unit, a VoiceXML document is a container of dialog units such as menus and forms, and each contains logic to sequence the interpreter to the next unit.
Another consequence of the sequential nature of the voice medium is a need for the markup to contain application logic for sequencing among dialog units. This is reflected in a tighter integration of sequential logic elements into VoiceXML than in HTML. For example, VoiceXML contains markup elements for sequence control, while in HTML such control is only available through the relatively more cumbersome method of scripting.
In summary, VoiceXML builds on the same distributed model that has been so successful for visual web-based services. In addition to the core capabilities described above, VoiceXML provides a number of more advanced features: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, the voice medium is sequential and therefore differs from HTML in several aspects because of differences between visual and voice interactions.
The third and final part of Building Conversational Applications Using VoiceXML will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.
About the Authors
Jeff Kusnitz has been with IBM since 1987 and focuses on telephony and speech recognition platforms. He is currently IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group on voice application specifications and platform and developer certifications.
Bruce Lucas has been with IBM since 1986. He was the lead designer and developer for IBM’s Speech Mark-up Language and VoiceXML browsers, and has been IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group, and co-author of and major contributor to the VoiceXML 1.0 and 2.0 and related W3C specifications.