Voice VoiceXML Developer Series: A Tour Through VoiceXML, Part V

VoiceXML Developer Series: A Tour Through VoiceXML, Part V

In the last edition of the VoiceXML Developer, we created
a full VoiceXML application using form fields, a subdialog, and internal
grammars. In this edition, we will learn more about one of the most
important, but rarely covered components of a VoiceXML application,
grammars.

Overview

Now that we’ve built a few applications, it’s time to talk about grammars.
Grammars tell the speech recognition software the combinations of words and
DTMF tones that it should be listening for. Grammars intentionally limit what
the ASR engine will recognize. The method of recognizing speech without
the burden of grammars is called “continuous speech recognition” or CSR. IBM’s
Via Voice is an example of a product that uses CSR
technology to allow a user to dictate text to compose an email or dictate a document.
While CSR technologies have improved, they’re not accurate enough to
use without the user training the system to recognize their voice. Also,
the success rate of recognition in noisy environments, such as over a cell phone
or in a crowded shopping mall, is reduced greatly. Pre-defining the scope
of words and phrases that the ASR engine should be listening for can increase
the recognition rate to well over 90%, even in noisy environments. The VoiceXML
1.0 standard uses grammars to recognize spoken and DTMF input. It doesn’t,
however, define the grammar format. This is changing however with the
release of VoiceXML 2, which defines a standard XML-based and alternate BNF
notation grammar format. Still, the fact that VoiceXML relies heavily on
grammars means that we must create or reuse grammars each time we want
to gather input from the user.

In fact, the time required to create, maintain, and tune VoiceXML grammars
will likely be several magnitudes greater than the time you will take to develop
the VoiceXML interfaces. Not having
high-quality and complete grammars means that the user will spend too much
of their time repeating themselves. A system that cannot recognize input
the first time, every time, will alienate users and cause them to abandon
the system altogether. Therefore, we are going to spend a bit of time
talking about grammars for VoiceXML 1.0 (and now VoiceXML 2) in the
coming articles so that you will be armed with the knowledge you need
to create successful VoiceXML applications. The first grammar format we
are going to learn is GSL, which is used by the Nuance line of products.

Grammar Scopes

The ASR engine activates grammars based upon
the scope in which the grammar was declared and the current scope of
the VoiceXML interpreter. Declaring a grammar in the root document
means that the grammar will be active throughout the execution of the
VoiceXML application. A good use for this technique is to use a root
grammar to define global voice commands such as “operator” for connecting
to an operator or “goodby” to exit the call.

We can also have grammars that are active within a particular document,
form, field, or menu. Field grammars will be used the most where we need
to collect specific types of information, such as a phone number, address,
or social security number. What you don’t want is to have all grammars
active at the same time unless it is a mixed initiative dialog.
The more grammars that are active, the higher the chance that the ASR
will misinterpret what the
user is saying. For example, when we ask the user for their phone number,
only a global menu and the phone number grammars should be active. If the
social security grammar were active at the same time, the system may
accidentally recognize a social security number rather than a phone number.

When developing a mixed initiative dialog, this problem can become
especially tricky where we may have similar grammars active at the
same time. It’s especially important in this case to differentiate the
grammars in a way that minimizes the possibility of input being matched
by the wrong grammar.

Inline grammars versus external grammars

VoiceXML allows developers to include grammars directly into the
VoiceXML documents using the <grammar> element.

<grammar type="text/gsl">
  <![CDATA[
    ([
      small
      medium
      large
     ])
  ]]>
</grammar>

The inline grammar above would match on the words small,
medium, or large. The values that was matched by
the grammar would be returned and stored as the form field value.

An external grammar exists in a separate file, which is referenced
by the src attribute of the <grammar>
element.

<grammar src="PHONE.gsc" />

The <grammar> above would load the grammar named
PHONE.gsc.

Inline grammars are good for small VoiceXML applications that have simple grammars, but
should be avoided for larger applications that have multiple grammars.
First of all, you will likely be able to reuse grammars many times,
so it’s best to keep them in an external file where you can easily
access them from within other applications. Secondly, you may find yourself tuning
the grammars on a more or less frequent basis than the VoiceXML content, so
it’s a good idea to componentize your VoiceXML applications to minimize
errors that could result from a change to a grammar in a VoiceXML file.
Other than their location, inline grammars work just like external grammars.

Example 4

We will be referring to this example
in the rest of this article. To
test this application, dial the VoiceXML Planet call VoiceXML Planet at 510-315-6666;
press 1 to listen to the demos, then press 4 to hear this example. The example
is an application for Joe’s Pizza Palace. Joe’s store get’s overloaded
with pizza pie orders during the lunch hour. Joe doesn’t want to hire more staff
to take phone orders just for lunch, but he does want to give his customers who
call in their orders the opportunity to place their order automatically. This is
especially desirable for repeat customers who order pizzas for their office lunches
and meetings on a regular basis. This first version of the application collects the
information for one pizza order and submits it to a back end ASP script for processing.
The information that the store needs to place an order is the customer’s phone number,
the size and type of the pizza, and the toppings.

An example dialog for the application might be as follows:

Computer: Joe's pizza palace. May I 
have your phone number please.
Customer: huh?
Computer: Sorry, I didn't get that. Please 
say your phone number.
Customer: 7 0 3 5 5 5 1 2 1 2.
Computer: I heard 7 0 3 5 5 5 1 2 1 2. Would 
you like a hand tossed, 
          deep dish, or stuffed crust pizza?
Customer: Deep dish.
Computer: I heard deep dish. Would you like 
a small, medium, or large?
Customer: Large.
Computer: I heard large. What toppings would 
you like on your deep dish pizza?
Customer: Pepperoni and mushrooms and anchovies.
Computer: I heard pepperoni and mushrooms and 
anchovies.
Computer: I have a large deep dish pizza with 
pepperoni and mushrooms and anchovies. Your order will be 
delivered within thirty minutes or the pizza is free. Thanks 
for calling Joe's pizza palace.

Once the order has been confirmed, the form field values are submitted via an HTTP POST
method call to placeOrder.asp via the <submit> element.
The example contains two inline grammars
and two external grammars, which are used
to recognize spoken input. The two inline grammars occur on lines 23-29 and 41-47.
The two external grammars occur on line 10 and 59.

Keyword grammars


Let’s take a look at the inline
grammar on lines 41-47 first. This is probably the simplist form of a grammar. It
contains three words, each representing a different selection. The ASR will attempt
to recognize one of these three words after the prompt is played on line 48. If one
of the words was not recognized or if the user didn’t say anything, the <catch>
element on lines 49-51 will tell the user that there was a problem and play the
prompt again until the user says one of the options, small, medium,
or large. Once the user provides valid input, the <filled> element
for the size field is executed on lines 52-56. Notice that this
grammar only contains single words rather than phrases.

Phrase grammars

The second inline grammar on lines 23-29 works within the scope of the
pizza_type form field and will recognize one of three
phrases only:

  • hand tossed
  • deep dish
  • stuffed crust

The three phrases are surrounded by parentheses. This indicates that
all words inside the parenthesis must be spoken for a match to occur.
We can specify optional words in the phrase by pre-pending them with
a ? character. For example, to make hand, deep, and
crust optional, we would change the grammar so it looked like the following:

<![CDATA[([
       ( ?hand tossed )
       ( ?deep dish )
       ( stuffed ?crust )
])]]>

So if the user just said “tossed“, we would match hand tossed.
We can add alternatives for each selection as well. For example, someone might
say “Chicago” instead of deep dish. We might also want to allow
someone to specify hand thrown or hand stretched as alternatives
to hand tossed. We can do this by specifying the options inside a
set of square brackets.

<![CDATA[([
       ( ?hand [tossed stretched thrown] ) 
       ( ?deep [dish chicago] )
       ( stuffed ?crust ) 
])]]>

Subgrammars

Now we’re going to take a look at the external grammar that we reference on
line 10, which is used to recognize the user’s phone number. This particular
grammar is made up of several subgrammars that recognize the area code,
exchange (the first 3 digits of the local phone number), and the last four
digits of the phone number. These subgrammars, or phone number parts, are
referenced in the PHONE grammar on lines 1-6.
This grammar is listed below. The PHONE matches a number
when the AREA_CODE, EXCHANGE, and
NUMBER grammars are matched in that order since they’re
inside a set of parentheses, which require that all elements of the grammar
match. Line 6 concatenates the three phone number components together
as a single number and returns the number to the field, which uses the
number as the value for the phone. Notice that each subgrammar
called on lines 2-4 include a colon and second string, which names a local
variable to store the results of the subgrammar. For example,
one line 2, we call the AREA_CODE subgrammar and store
the resulting number that was matched in the $area variable.
These variables are referenced later on line 6, which returns the phone
number. Line 6 utilizes the strcat() function to piece
the numbers into one number. The strcat() function
takes two parameters, the second of which will be concatenated to the first.
To concatenate all three number segments, we join $exchange and $number
in an inner strcat() function call with an outer
call, which joins the results of the inner call with $area.

The AREA_CODE grammar on lines 8-13 is made up of exactly
three DIGITs. The DIGIT grammar on lines
30-41 consists of a single number, zero through nine. Zero can either be
pronounced zero or oh. Similarly, the EXCHANGE
grammar is made up of three DIGITs, while the NUMBER
grammar is made up of four DIGITs.

1  PHONE [
2     ( AREA_CODE:area
3       EXCHANGE:exchange
4       NUMBER:number
5     )
6  ] { return(strcat($area strcat($exchange $number))) }
7  
8  AREA_CODE [
9    ( DIGIT:a
10      DIGIT:b
11      DIGIT:c
12    ) { return(strcat($a strcat($b $c))) }
13  ]
14  
15  EXCHANGE [
16    ( DIGIT:a
17      DIGIT:b
18      DIGIT:c
19    ) { return(strcat($a strcat($b $c))) }
20  ]
21  
22  NUMBER [
23    ( DIGIT:a
24      DIGIT:b
25      DIGIT:c
26      DIGIT:d
27    ) { return(strcat(strcat($a $b) strcat($c $d))) }
28  ]
29  
30  DIGIT [
31    [zero oh] {return(0)}
32    one   {return(1)}
33    two   {return(2)}
34    three {return(3)}
35    four  {return(4)}
36    five  {return(5)}
37    six   {return(6)}
38    seven {return(7)}
39    eight {return(8)}
40    nine  {return(9)}
41  ]

As you can see from the example above, more complex grammars are made up
of subgrammars, which may subsequently call on other subgrammars, so that
we can match any form of speech by breaking the possibilities down into
their most elementary components. You might also be surprised at how large
our grammar turned out to be for a simple phone number. In fact, dealing
with numbers can be alot more difficult than dealing with words.

Lists in grammars

In the grammar referenced on line 59, we must be able to match one or
more toppings without knowing exactly how many topics the user will select.
What we do know is what the available topping are. Fortunately, GSL includes
a number of builtin list operators to make this requirement possible.

1  TOPPINGS [
2    +( TOPPING:topping {insert-end(list $topping)} )
3  ] {return($list)}
4  
5  TOPPING [
6    (?and pepperoni)
7    (?and olives)
8    (?and green peppers)
9    (?and mushrooms)
10   (?and pineapple)
11   (?and anchovies)
12 ] {return($string)}

The TOPPINGS grammar above begins with a
+ sign outside of a set of parenthesis. What this means is match
one or more occurences of the TOPPING grammar.
The second part of line 2 calls the builtin insert-end
function, which adds the new topping that was matched in the
TOPPING grammar to the list of toppings that will
be returned to the toppings form field in the
VoiceXML document.

The TOPPING grammar on lines 5-12 contains our
toppings selections: pepperoni, olives, green peppers, mushrooms,
pineapple, and anchovies. We’re also expecting that the user might
separate their selections with the word and, which has
been flagged as an optional word by pre-pending it with a ?
character. That concludes our exploration of GSL grammars for now.

Conclusion

I want to reflect on some of the things that I’ve learned as I’ve been
developing new VoiceXML applications over the past year as it relates to
grammars. First, grammars can be difficult to develop and time consuming
to tune. And things don’t stop there. You will probably need to tune the
dictionary that the system is using to include alternate word pronunciations
as you begin to collect data on where the ASR application is failing.
It’s very important that the application
will be able to recognize what the user is saying most of the time. Because
DTMF input is almost 100% accurate, it should be preferred over speech for
things like phone and credit card numbers. However, some voice interface
designers recommend that you don’t mix a touch-tone input with speech
input. I’d say it’s better than the alternative if you are having problems
recognizing number sequences. Remember, speech recognition has
gotten much better, but it still takes a great deal of work and care to
reach the high 90s percentile success rates that vendors often mention.
Thanks again for joining us for another edition of the VoiceXML Developer.
In the next edition of the VoiceXML Developer, we will continue our
exploration into grammars as part of our tour of the VoiceXML 1.0 specification.
And don’t forget to send me feedback on this series. I’d like to know
how I’m doing and how I can improve this column. You can send feedback
directly to [email protected].
Until next time.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia
that specializes in Voice Web consulting and training. He has also written
articles for other online and print publications including WebReference.com
and WDVL.com. Feel free to send an email to [email protected] regarding
questions or comments about the VoiceXML Developer series, or for more
information about training and consulting services.

Latest Posts

Related Stories