January 27, 2021
Hot Topics:

VoiceXML Developer Series: A Tour Through VoiceXML, Part XII

  • By Jonathan Eisenzopf
  • Send Email »
  • More Articles »

In this edition of the series, we complete the first version of Frank's Pizza Palace application by developing the remaining VoiceXML dialogs.


Last time, we developed the first three dialogs in our application. Now it's time to complete the rest of the dialogs and begin testing our application.

The first three dialogs were main.vxml, telephone_number.vxml, and validate_phone_number.vxml. These dialogs played a greeting for the user, prompted them for their phone number, and looked up their address in the Access database respectively.

Assuming that the user did indeed have a record in the database and confirmed that it was correct, the dialog transitions to take_order.vxml.


Now it's time to take the customer's order (view source). This VoiceXML dialog is similar to the pizza ordering application we developed in an earlier edition of this series. There are some things that have changed however. The customer's phone number has been stored in the application root as application.phone_number through a previous dialog. I've also added a <property> element on line 4. VoiceXML properties provide various controls on how a VoiceXML dialog functions. This particular property sets the minimum confidence level that the ASR must achieve to successfully recognize an utterance. Values can range from 0.1 to 1, whose value represents a percentage from 10% to 100%. A value of 1 tells the ASR that it must be 100% confident that it has recognized an utterance. Most ASRs will be set to 0.5 (or 50%) by default. I have lowered the default to 30% so that the ASR will not fail because of false negatives. I chose this value after testing the grammar for a while in Nuance V-Builder. I found that while the confidence level fell below 50% when there was background noise, the ASR still produced accurate results the majority of the time.

This is a mixed initiative dialog, meaning that a user can fill in multiple fields with a single utterance. The <initial> element on lines 7 through 11 provide this functionality. This section of the application will execute first and try to match the grammar referenced on line 6. I've made some significant changes to the PIZZA subgrammar in the PIZZA.grammar file since then (view source). The reason for the change has to do with the many variations that a customer might use to order a pizza. After coding and testing about 20 additional variances, I realized that it was ripe for consolidation using positive (+) and kleene (*) operators. A positive closer will match one or more occurrences of the phrase that is located to the right of the operator. The kleene closer will match zero or more occurrences of the phrase to its right.

Line 6 is listed below:

+([SIZE TYPE TOPPINGS] *[pizza with])

The + (or positive closer) operator enables this subgrammar to match numerous variations of a pizza order. A customer can start with pizza size, type, or toppings, optionally followed by the words pizza and/or with. This grammar will match any of the following utterances:

  • small hand tossed pepperoni pizza
  • deep dish large mushroom and pepperoni pizza
  • small pepperoni
  • pizza with olives and mushrooms
  • The number of possible utterances that this grammar will match is too high to count (for me at least). One of the side-effects of this more open grammar is that ASR confidence for matches went down from 60%-85% to as low as 40%. The rate of incorrect matches also rose in some cases where I was not speaking directly into the microphone or did not speak clearly. After lowering the confidence property and tuning the grammar a bit, I decided that the greater breath of possibilities was worth the tradeoff. Of course, if the grammar only matches a few of the form fields, the application can prompt for the unfilled values separately. In cases where grammars start becoming more dynamic, it may be necessary to process the matched text to see if the ASR actually provided a false match. This requires some fancy text processing and/or natural language processing techniques, which we'll save for another time.

    Yet another difference is the fact that we are using pre-recorded prompts instead of synthesized speech from the TTS engine. This really enhances the quality and usability of the application.

    Once we've filled all the fields in the form, the input is sent to the save_order.asp script.

    Page 1 of 2

    This article was originally published on October 12, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date