VoiceBuilding VoiceXML Dialogs

Building VoiceXML Dialogs

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.


Until fairly recently, the web has primarily delivered information and services using visual interfaces, on computers equipped with displays, keyboards, and pointing devices. The web revolution had largely bypassed the huge market of customers of information and services represented by the worldwide installed base of telephones for which voice input and audio output provided the primary means of interaction.

VoiceXML 2.0 [VXML2], a Standard recently released by the W3c [W3C] is helping to change that. Building on top of the market established in 1999 by the VoiceXML Forum’s VoiceXML 1.0 specification [VXML1], VoiceXML 2.0 and several complementary standards are changing the way we interact with voice services and applications – by simplifying the way these services and applications are built.

VoiceXML is an XML-based [XML] language, designed to be used on the Web. As such, it inherits several key features common to all XML languages:

  • It leverages existing Web protocols such as HTTP to access remote resources
  • Any tool that is able to read or write XML documents can read and write a VoiceXML document
  • Other XML documents can be embedded in VoiceXML documents and fragments; similarly, VoiceXML documents can embed other XML documents and fragments. This is the case with SRGS and SSML, which is described later.

As mentioned above, VoiceXML 2.0 is one in a number of Standards the W3C Voice Browser Working Group is defining to enable the development of conversational voice applications. The specifications making up the Speech Interaction Framework are:

  • Speech Recognition Grammar Specification 1.0 [SRGS]
  • Speech Synthesis Markup Language 1.0 [SSML]
  • Semantic Interpretation for Speech Recognition 1.0 [SISR]
  • Call Control Markup Language 1.0 [CCXML]

This article is the first in a three-part series that provides an introduction to VoiceXML, as well as SRGS, SSML, and SISR for building conversational web applications. In this first installment the focus will be on building VoiceXML dialogs through both menu and form elements. The second part will outline how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, the third article will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.

The Menu Element

Most VoiceXML dialogs are built from one of two elements. The first of these is the <menu> element. A VoiceXML menu behaves much like a collection of HTML links.

A VoiceXML menu has a <prompt>, which contains SSML content, and one or more choices, each identified by a <choice> tag. Each choice consists of a phrase indicating what the user can say, as well as a link to the next VoiceXML document to be executed.

Consider this <menu> example:

    <vxml version="2.0" >

            <prompt>Say one of: <enumerate/></prompt>
            <choice next="http://www.example.com/sports.vxml">
                Sports scores
            <choice next="http://www.example.com/weather.vxml">
                Weather information
            <choice next="#login">
                Log in


One possible path through this dialog would be:

   Browser: Say one of: Sports scores; Weather information; Log in.
   User: Sports scores

When the VoiceXML Browser recognizes that the user has spoken “sports scores,” it fetches the VoiceXML document identified by the corresponding choice (http://www.example.com/sports.vxml) and begins executing it, presumably providing the user with sports information.

The Form Element

The second dialog element in VoiceXML is the <form> element. A VoiceXML form is very similar to an HTML form in that it typically contains one or more input fields that a user must complete. Each input field in a form has a prompt and a specification of what a user can say to fill in the field.

A sample “login” form might look like this:

    <vxml version="2.0" >

        <form id="login">

            <field name="phone_number" type="phone">
                <prompt>Please say your complete phone number</prompt>

            <field name="pin_code" type="digits">
                <prompt>Please say your PIN code</prompt>

                <submit next="http://www.example.com/servlet/login"
                        namelist="phone_number pin_code"/>


When this form is executed, the dialog flow would be:

   Browser: Please say your complete phone number
   User: 800-555-1212
   Browser: Please say your PIN code
   User: 1 2 3 4

As each <field> is executed, its <prompt> is played. Following the prompt, the user responds by speaking the requested information. When both fields in the form have been filled, the final block is executed. In this example, the executes a submit tag, which sends the variables phone_number and pin_code to the “login” servlet, in much the same way as a “submit” button works on an HTML form. The servlet would then return a new document for the VoiceXML Browser to execute.

As mentioned earlier, each field specifies the set of acceptable user responses. Limiting the acceptable responses serves two purposes. First, it allows the responses to be verified and for help in the case of an invalid response to be provided locally without delay of a round-trip over the network to the application server. Second, it is essential to achieving good speech-recognition accuracy – particularly over a relatively low-quality audio channel like a telephone – for the user input to be constrained to particular sets and patterns of words.

In the earlier example, the set of acceptable user inputs is specified implicitly using the “type” attribute (“phone” and “digits” in the example) in the field element. The VoiceXML 2.0 specification defines a number of built-in types that a VoiceXML browser may optionally provide:

  • boolean
  • date
  • digits
  • currency
  • number
  • phone
  • time

In addition to these built-in types, a VoiceXML application can specify its own input types using grammars. A grammar is essentially an enumeration in a compact form of a set of allowable phrases. The following VoiceXML fragment illustrates the use of grammars in an online voice-enabled restaurant application:


        <field name="drink">
            <prompt>What would you like to drink?</prompt>
            <grammar mode="voice" xml_lang="en-US" version="1.0"
                <rule id="drink">
                        <item> coffee </item>
                        <item> tea </item>
                        <item> orange juice </item>
                        <item> milk </item>
                        <item> nothing </item>

        <field name="sandwich">
            <prompt>What sandwich would you like?</prompt>
            <grammar src="sandwiches.grxml"/>

            <submit next="http://www.example.com/servlet/getOrder"/>


The grammars in this example are specified using the W3C Speech Recognition Grammar Specification [SRGS] format. The first grammar is in-line, and it simply identifies a list of words and phrases (“coffee”, “tea”, and so on) that the user may say in response to the prompt for that field. Surrounding the list of items with a <one-of> element tells the VoiceXML browser that the user can speak only one of these items at a time.

The second grammar is contained in the dile “sandwiches.grxml” and is referenced via a URI:

    <grammar mode="voice" xml_lang="en-US" version="1.0"

        <rule id="bread">
                <item> rye </item>
                <item> white </item>
                <item> whole wheat </item>

        <rule id="ingredient">
                <item> ham </item>
                <item> roast beef </item>
                <item> tomato </item>
                <item> lettuce </item>
                     <item> swiss </item>
                     <item repeat="0-1"> cheese </item>

        <rule id="sandwich">
            <ruleref uri="#ingredient"/>
            <item repeat="0-">
                <item repeat="0-1"> and </item>
                <ruleref uri="#ingredient"/>
            <item> on </item>
            <ruleref uri="#bread"/>


This grammar consists of three rules. The first rule, named “bread” is just a list of bread types, similar to “drink” grammar which was place in-line in the form. It allows the user to say either “rye” or “white” or “whole wheat.”

The second rule in this grammar, named “ingredient” is also a fairly simple list of items, but one of its items included an optional part. The last item in the rule includes a repeat attribute on the “cheese” item, which makes “cheese” optional. To match this rule, the user can say either “swiss” or “swiss cheese”.

The third rule, named “sandwich”, specifies that a complete description of a sandwich consists of a series of rule references, to the other rules defined within this grammar. It states that a “sandwich” is made up of at least one ingredient, followed by zero or more additional ingredients optionally separated by the word “and”, and ending finally with the word “on” followed by the name of a bread. This rule would accept phrases such as “ham and swiss on rye” and “lettuce and tomato on whole wheat.”

A grammar’s header is typically used to identify a single rule within the grammar as its “root rule” via the “root” attribute. This root rule is automatically activated when the grammar is referenced by a VoiceXML application, unless the application indicates that a different rule should be applied.

The grammars in the previous examples were written in SRGS XML format. They could have also been written using SRGS Augmented BNF syntax. In addition to being a more compact, more human readable form, grammars written in ABNF syntax are also more familiar to grammar developers.

An ABNF version of the sandwich grammar might look like this:

  #ABNF 1.0 ISO-8859-1;

  language en-US;
  root $sandwich;

  $ingredient = ham | roast beef | tomato | lettuce |
                swiss [ cheese ];

  $bread = rye | white | whole wheat;

  $sandwich = $ingredient ( [ and ] $ingredient ) <0-> on $bread;

A typical dialog enabled by the above form and either of the grammars
might be:

   Browser: What would you like to drink?
   User: Orange juice
   Browser: What sandwich would you like?
   User: Roast beef lettuce and swiss on rye

As with the previous form example, once the browser has collected the input for both fields, the final block will be executed and cause the variables drink and sandwich to be sent to the getOrder application for processing.

In each of the preceding VoiceXML examples, tags were used to indicate text that would be synthesized by the browser and spoken to the user. The contents of each is Speech Synthesis Markup Language, or SSML. SSML not only allows a voice application developer to specify text to be synthesized, but it also provides a means to specify prerecorded audio that should be played.

In addition, SSML has numerous parameters to control the output itself. There are parameters for controlling the output volume, the rate the synthesized text is spoken at, what portions are emphasized, and so on.

A complete discussion of SRGS and SSML is beyond the scope of this article. For further reading, their respective specifications are a good starting point. Stay tuned for the second installment of Building Conversational Applications Using VoiceXML which will address how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs.

About the Authors

Jeff Kusnitz has been with IBM since 1987 and focuses on telephony and speech recognition platforms. He is currently IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group on voice application specifications and platform and developer certifications.

Bruce Lucas has been with IBM since 1986. He was the lead designer and developer for IBM’s Speech Mark-up Language and VoiceXML browsers, and has been IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group, and co-author of and major contributor to the VoiceXML 1.0 and 2.0 and related W3C specifications.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories