VoiceXML Developer Series: A Tour Through VoiceXML, Part V
In the last edition of the VoiceXML Developer, we created a full VoiceXML application using form fields, a subdialog, and internal grammars. In this edition, we will learn more about one of the most important, but rarely covered components of a VoiceXML application, grammars.
Now that we've built a few applications, it's time to talk about grammars. Grammars tell the speech recognition software the combinations of words and DTMF tones that it should be listening for. Grammars intentionally limit what the ASR engine will recognize. The method of recognizing speech without the burden of grammars is called "continuous speech recognition" or CSR. IBM's Via Voice is an example of a product that uses CSR technology to allow a user to dictate text to compose an email or dictate a document. While CSR technologies have improved, they're not accurate enough to use without the user training the system to recognize their voice. Also, the success rate of recognition in noisy environments, such as over a cell phone or in a crowded shopping mall, is reduced greatly. Pre-defining the scope of words and phrases that the ASR engine should be listening for can increase the recognition rate to well over 90%, even in noisy environments. The VoiceXML 1.0 standard uses grammars to recognize spoken and DTMF input. It doesn't, however, define the grammar format. This is changing however with the release of VoiceXML 2, which defines a standard XML-based and alternate BNF notation grammar format. Still, the fact that VoiceXML relies heavily on grammars means that we must create or reuse grammars each time we want to gather input from the user.
In fact, the time required to create, maintain, and tune VoiceXML grammars will likely be several magnitudes greater than the time you will take to develop the VoiceXML interfaces. Not having high-quality and complete grammars means that the user will spend too much of their time repeating themselves. A system that cannot recognize input the first time, every time, will alienate users and cause them to abandon the system altogether. Therefore, we are going to spend a bit of time talking about grammars for VoiceXML 1.0 (and now VoiceXML 2) in the coming articles so that you will be armed with the knowledge you need to create successful VoiceXML applications. The first grammar format we are going to learn is GSL, which is used by the Nuance line of products.
The ASR engine activates grammars based upon the scope in which the grammar was declared and the current scope of the VoiceXML interpreter. Declaring a grammar in the root document means that the grammar will be active throughout the execution of the VoiceXML application. A good use for this technique is to use a root grammar to define global voice commands such as "operator" for connecting to an operator or "goodby" to exit the call.
We can also have grammars that are active within a particular document, form, field, or menu. Field grammars will be used the most where we need to collect specific types of information, such as a phone number, address, or social security number. What you don't want is to have all grammars active at the same time unless it is a mixed initiative dialog. The more grammars that are active, the higher the chance that the ASR will misinterpret what the user is saying. For example, when we ask the user for their phone number, only a global menu and the phone number grammars should be active. If the social security grammar were active at the same time, the system may accidentally recognize a social security number rather than a phone number.
When developing a mixed initiative dialog, this problem can become especially tricky where we may have similar grammars active at the same time. It's especially important in this case to differentiate the grammars in a way that minimizes the possibility of input being matched by the wrong grammar.
Inline grammars versus external grammars
VoiceXML allows developers to include grammars directly into the VoiceXML documents using the <grammar> element.
<grammar type="text/gsl"> <![CDATA[ ([ small medium large ]) ]]> </grammar>
The inline grammar above would match on the words small, medium, or large. The values that was matched by the grammar would be returned and stored as the form field value.An external grammar exists in a separate file, which is referenced by the src attribute of the <grammar> element.
<grammar src="PHONE.gsc" />
The <grammar> above would load the grammar named PHONE.gsc.
Inline grammars are good for small VoiceXML applications that have simple grammars, but should be avoided for larger applications that have multiple grammars. First of all, you will likely be able to reuse grammars many times, so it's best to keep them in an external file where you can easily access them from within other applications. Secondly, you may find yourself tuning the grammars on a more or less frequent basis than the VoiceXML content, so it's a good idea to componentize your VoiceXML applications to minimize errors that could result from a change to a grammar in a VoiceXML file. Other than their location, inline grammars work just like external grammars.
We will be referring to this example in the rest of this article. To test this application, dial the VoiceXML Planet call VoiceXML Planet at 510-315-6666; press 1 to listen to the demos, then press 4 to hear this example. The example is an application for Joe's Pizza Palace. Joe's store get's overloaded with pizza pie orders during the lunch hour. Joe doesn't want to hire more staff to take phone orders just for lunch, but he does want to give his customers who call in their orders the opportunity to place their order automatically. This is especially desirable for repeat customers who order pizzas for their office lunches and meetings on a regular basis. This first version of the application collects the information for one pizza order and submits it to a back end ASP script for processing. The information that the store needs to place an order is the customer's phone number, the size and type of the pizza, and the toppings.
An example dialog for the application might be as follows:
Computer: Joe's pizza palace. May I have your phone number please. Customer: huh? Computer: Sorry, I didn't get that. Please say your phone number. Customer: 7 0 3 5 5 5 1 2 1 2. Computer: I heard 7 0 3 5 5 5 1 2 1 2. Would you like a hand tossed, deep dish, or stuffed crust pizza? Customer: Deep dish. Computer: I heard deep dish. Would you like a small, medium, or large? Customer: Large. Computer: I heard large. What toppings would you like on your deep dish pizza? Customer: Pepperoni and mushrooms and anchovies. Computer: I heard pepperoni and mushrooms and anchovies. Computer: I have a large deep dish pizza with pepperoni and mushrooms and anchovies. Your order will be delivered within thirty minutes or the pizza is free. Thanks for calling Joe's pizza palace.
Once the order has been confirmed, the form field values are submitted via an HTTP POST
method call to placeOrder.asp via the <submit> element.
The example contains two inline grammars
and two external grammars, which are used
to recognize spoken input. The two inline grammars occur on lines 23-29 and 41-47.
The two external grammars occur on line 10 and 59.