August 4, 2020
Hot Topics:

Building VoiceXML Dialogs

  • By Jeff Kusnitz & Dr. Bruce Lucas
  • Send Email »
  • More Articles »


Until fairly recently, the web has primarily delivered information and services using visual interfaces, on computers equipped with displays, keyboards, and pointing devices. The web revolution had largely bypassed the huge market of customers of information and services represented by the worldwide installed base of telephones for which voice input and audio output provided the primary means of interaction.

VoiceXML 2.0 [VXML2], a Standard recently released by the W3c [W3C] is helping to change that. Building on top of the market established in 1999 by the VoiceXML Forum's VoiceXML 1.0 specification [VXML1], VoiceXML 2.0 and several complementary standards are changing the way we interact with voice services and applications - by simplifying the way these services and applications are built.

VoiceXML is an XML-based [XML] language, designed to be used on the Web. As such, it inherits several key features common to all XML languages:

  • It leverages existing Web protocols such as HTTP to access remote resources
  • Any tool that is able to read or write XML documents can read and write a VoiceXML document
  • Other XML documents can be embedded in VoiceXML documents and fragments; similarly, VoiceXML documents can embed other XML documents and fragments. This is the case with SRGS and SSML, which is described later.

As mentioned above, VoiceXML 2.0 is one in a number of Standards the W3C Voice Browser Working Group is defining to enable the development of conversational voice applications. The specifications making up the Speech Interaction Framework are:

  • Speech Recognition Grammar Specification 1.0 [SRGS]
  • Speech Synthesis Markup Language 1.0 [SSML]
  • Semantic Interpretation for Speech Recognition 1.0 [SISR]
  • Call Control Markup Language 1.0 [CCXML]

This article is the first in a three-part series that provides an introduction to VoiceXML, as well as SRGS, SSML, and SISR for building conversational web applications. In this first installment the focus will be on building VoiceXML dialogs through both menu and form elements. The second part will outline how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, the third article will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.

The Menu Element

Most VoiceXML dialogs are built from one of two elements. The first of these is the <menu> element. A VoiceXML menu behaves much like a collection of HTML links.

A VoiceXML menu has a <prompt>, which contains SSML content, and one or more choices, each identified by a <choice> tag. Each choice consists of a phrase indicating what the user can say, as well as a link to the next VoiceXML document to be executed.

Consider this <menu> example:

    <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">

            <prompt>Say one of: <enumerate/></prompt>
            <choice next="http://www.example.com/sports.vxml">
                Sports scores
            <choice next="http://www.example.com/weather.vxml">
                Weather information
            <choice next="#login">
                Log in


One possible path through this dialog would be:

   Browser: Say one of: Sports scores; Weather information; Log in.
   User: Sports scores

When the VoiceXML Browser recognizes that the user has spoken "sports scores," it fetches the VoiceXML document identified by the corresponding choice (http://www.example.com/sports.vxml) and begins executing it, presumably providing the user with sports information.

The Form Element

The second dialog element in VoiceXML is the <form> element. A VoiceXML form is very similar to an HTML form in that it typically contains one or more input fields that a user must complete. Each input field in a form has a prompt and a specification of what a user can say to fill in the field.

A sample "login" form might look like this:

    <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">

        <form id="login">

            <field name="phone_number" type="phone">
                <prompt>Please say your complete phone number</prompt>

            <field name="pin_code" type="digits">
                <prompt>Please say your PIN code</prompt>

                <submit next="http://www.example.com/servlet/login"
                        namelist="phone_number pin_code"/>


When this form is executed, the dialog flow would be:

   Browser: Please say your complete phone number
   User: 800-555-1212
   Browser: Please say your PIN code
   User: 1 2 3 4

As each <field> is executed, its <prompt> is played. Following the prompt, the user responds by speaking the requested information. When both fields in the form have been filled, the final block is executed. In this example, the executes a submit tag, which sends the variables phone_number and pin_code to the "login" servlet, in much the same way as a "submit" button works on an HTML form. The servlet would then return a new document for the VoiceXML Browser to execute.

Page 1 of 3

This article was originally published on August 16, 2004

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date