This article is the third article in a three-part series that provides an introduction to VoiceXML, as well as SRGS, SSML, and SISR for building conversational web applications. The first installment discussed building VoiceXML dialogs through both menu and form elements. The second outlined how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, this piece will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.
To review from the first two articles, the web has primarily delivered information and services using visual interfaces and as a result has largely bypassed customers that primarily use the telephone – for which voice input and audio output provide the primary means of interaction.
Building on top of the market established in 1999 by the VoiceXML Forum’s VoiceXML 1.0 specification [VXML1], VoiceXML 2.0 and several complementary standards are changing the way we interact with voice services and applications – by simplifying the way these services and applications are built.
Natural Dialog
VoiceXML as presented in the first two articles provides the capability for simple “directed” dialogs, meaning that the computer directs the conversation at each step by prompting the user for the next piece of information. Dialogs between humans of course don’t operate on this simple model. In a natural dialog each participant may at various stages take the initiative in leading the conversation. A computer-human dialog modeled on this idea is referred to as a “mixed-initiative” dialog because either the computer or the human may take the initiative in leading the conversation.
The field of spoken interfaces is not nearly as mature as the field of visual interfaces, so standardizing an approach to natural dialog is more difficult than designing a standard language for describing visual interfaces like HTML. Nevertheless VoiceXML takes some modest steps toward allowing applications to be built that give the user some degree of control over the conversation.
In each of the dialogs shown below (taken from part #1 in this series), the user is asked to supply input by speaking a value for each field of a form in sequence.
Example 1:
Browser: Please say your complete phone number.
User: 800-555-1212
Browser: Please say your PIN code.
User: 1 2 3 4
Example 2:
Browser: What would you like to drink?
User: Orange juice.
Browser: What sandwich would you like?
User: Roast beef lettuce and swiss on rye.
The set of phrases that the user could speak in response to each field prompt was specified by a separate grammar for each field. This approach allows the user to supply one field value in sequence.
Consider a dialog for making airline travel reservations in which the user must supply a date, a city to fly from, and a city to fly to. A directed dialog conversation for completing such a form might proceed as follows:
Browser: Where are you traveling from?
User: New York.
Browser: Where are you traveling to?
User: Chicago.
Browser: When would you like to travel?
User: …
By contrast, a somewhat more natural dialog might proceed as follows:
Browser: How can I help you?
User: I’d like to fly from New York To Chicago
Browser: When would you like to travel?
User: …
VoiceXML enables such dialogs by allowing input grammars to be specified at the form level, not just at the field level and by further annotating the rules in the grammar to identify which portion of a user’s input is intended for which field in the form. This annotation is accomplished through semantic interpretation.
Consider the following VoiceXML document:
<vxml version="2.0" > <form id="traveling"> <grammar src="travel.grxml" type="application/srgs+xml"/> <initial name="get_info"> <prompt> How can I help you? </prompt> <catch event="nomatch"> <prompt> Let's try getting each field separately. </prompt> <reprompt/> <assign name="get_info" expr="true"/> </catch> </initial> <field name="from_city"> <grammar src="travel.grxml#fromWhere"/> <prompt>What city are you traveling from?</prompt> </field> <field name="to_city"> <grammar src="travel.grxml#toWhere"/> <prompt>What city are you traveling to?</prompt> </field> <block> <submit next="http://www.example.com/servlet/flyaway"/> </block> </form> </vxml>
And the grammar that it references:
<grammar mode="voice" xml_lang="en-US" version="1.0" root="travel"> <rule id="travel"> <item repeat="0-1"> I'd like to fly </item> <one-of>> <item> <ruleref uri="#fromWhere"/> <ruleref uri="#toWhere"/> <tag> $.from_city = $fromWhere; $.to_city = $toWhere; </tag> </item> <item> <ruleref uri="#toWhere"/> <ruleref uri="#fromWhere"/> <tag> $.to_city = $toWhere; $.from_city = $fromWhere; </tag> </item> </one-of> </rule> <rule id="fromWhere" scope="public"> <item> <item> from </item> <ruleref uri="#cities"/> <tag> $ = $cities </tag> </item> </rule> <rule id="toWhere" scope="public"> <item> <item> to </item> <ruleref uri="#cities"/> <tag> $ = $cities </tag> </item> </rule> <rule id="cities"> <one-of> <item> New York </item> <item> Chicago </item> <item> San Francisco </item> </one-of> </rule> </grammar>
Although the grammar looks fairly complex, it’s not really. It differs from earlier grammars only in that it now has semantic interpretation information included within it, indicated by the <tag> elements throughout.
These <tag> elements determine what information is returned to the VoiceXML application when a rule is matched. If the rule “fromwhere” is activated, for example, and the user speaks the phrase “from San Francisco,” the tag contents say that the “fromwhere” rule should ignore the word “from” and only return whatever the “cities” rule returned ($ = $cities). The “cities” rule, of course, would have returned the phrase “San Francisco”, so the “fromwhere” rule would then also return “San Francisco”.
Similarly, the more complex “travel” rule says that the user can say the optional phrase “I’d like to fly” followed by a phrase matching the “fromWhere” rule and then a phrase matching the “toWhere” rule, or even the converse.
As with the simple case, upon saying something that matches the rule of the grammar, the semantic interpretation information is then processed. The “travel” rule’s <tag> element says that when a phrase is matched, the results of the $fromWhere and $toWhere rules are to be stored in the $.from_city and $.to_city variables respectively. This action has a special meaning when the grammar is being used as a form level grammar, which is the case now. In this case, the VoiceXML Browser will fill the “from_city” field with the value from $.from_city, and the “to_city” field with the value from $.to_city.
When the VoiceXML application above is executed, the VoiceXML Browser will activate the form-level grammar and then execute the <initial> element, which will cause the “How can I help you?” prompt to be played. At this point, the user can either say a phrase that matches the grammar, such as “I’d like to fly from San Francisco to New York” or he/she can say something that does not match the grammar. In the former case, the semantic interpretation processing will result in the form’s two input fields being filled in, followed by the <submit> element within the <block> being executed.
In the event the user’s input does not match the active grammar, the <nomatch> event handler will terminate the <initial> element processing and collect each field’s input separately.
By further extending the grammar, any number of possible user inputs could be used to fill in the field information. For example, the “fromWhere” and “toWhere” rules could be made optional in the “travel” rule, allowing the user to speak “I’d like to fly from New York” without having to include “to Chicago.”
The ability to accept such free-form utterances is only a first step toward natural dialog. Over time, VoiceXML will continue to evolve to incorporate more advanced features in support of natural dialog.
To review, until recently the web revolution had largely bypassed the huge market of customers of information and services represented by the worldwide installed base of telephones. Thanks to the work by the W3C and the VoiceXML Forum several complementary standards are changing the way we interact with voice services and applications – by simplifying the way these services and applications are built. VoiceXML is an XML-based [XML] language, designed to be used on the Web. As such, it inherits several key features common to all XML languages: First, it leverages existing Web protocols such as HTTP to access remote resources; second, any tool that is able to read or write XML documents can read and write a VoiceXML document and third, other XML documents can be embedded in VoiceXML documents and fragments; similarly, VoiceXML documents can embed other XML documents and fragments.
About the Authors
Jeff Kusnitz has been with IBM since 1987 and focuses on telephony and speech recognition platforms. He is currently IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group on voice application specifications and platform and developer certifications.
Bruce Lucas has been with IBM since 1986. He was the lead designer and developer for IBM’s Speech Mark-up Language and VoiceXML browsers, and has been IBM’s representative to the VoiceXML Forum and the W3C Voice Browser working group, and co-author of and major contributor to the VoiceXML 1.0 and 2.0 and related W3C specifications.