Voice It's Only Natural: Evaluating Natural Language Dialogs

It’s Only Natural: Evaluating Natural Language Dialogs

Your decision on whether to use a natural dialog approach instead
of a directed dialog in an IVR
application will directly affect the cost, effort and maintenance
of the system. This article will give you a process that you can use
to make the right decision.

Natural Dialogs versus Directed Dialogs

A natural dialog is one in which the prompts, grammars and
dialog flow are modeled and designed to more closely simulate a real
conversation between two people. Natural dialogs allow the human to
participate in controlling the dialog flow. Directed dialogs on the
other hand use a pre-defined set of steps and usually occur in a
sequential, linear fashion. 

Directed dialogs are modeled in a dialog-flow fashion, similar to
a call-flow for touch tone IVRs. Natural dialogs, on the other hand,
typically utilize a finite state model where dialogs are executed
based on the state of one or more variables.

There isn’t a clear dividing line between directed dialogs and
natural dialogs however; nor is there an agreed upon approach as to
how they should be developed. Directed dialogs may certainly use a
mixed initiative dialog as a shortcut mechanism for power users; and
natural dialogs will certainly use directed dialogs.

Developers usually differentiate between directed and natural
dialogs in terms of whether or not the dialog is mixed initiative. A
mixed initiative dialog allows callers to fill multiple form field
slots in a single utterance. For example, the utterance,
"nineteen ninety ford escort" might fill the year, make and model fields with a single utterance, as opposed to
three separate prompts. This illustrates one potential description
of a natural dialog versus a directed dialog; not necessarily
because it’s mixed initiative, but because it is probably more
natural to speak all three pieces of information during a phone
conversation if you were describing the type of car you were selling
(for example).

What is important to understand is that while researchers have
tested various natural dialog approaches, there is no "right
way" or set of guidelines that will help you create natural
dialogs. I think this may have something to do with the fact that
scientists don’t understand the natural human speech machine in
enough detail to effectively model it on a computer. So, remember,
the definition of a natural dialog is a bit ambiguous and certainly
open to interpretation. In fact, in some cases, a directed dialog
may in fact be more natural than a mixed initiative dialog.

Evaluate the Existing Environment

There are three common environments that I see which affect
design decisions:

The company has an existing touch-tone IVR that they want
to upgrade to speech to reduce call times or reduce the number of
callers that bail out of the IVR to a live rep. In this situation,
introducing a new approach to callers that use the system regularly
may actually cause more harm than good, because while the number of
new callers pressing zero may be reduced, more experienced callers
may become frustrated with the new interface and end up bailing out.
If the percentage of repeat callers is high, and the ratio of repeat
callers to new callers is high, then the end result could actually
be more callers bailing out than before. Great care must be taken in
transitioning to a new system. A recommended approach would be to
gradually introduce speech into the application in a way that allows
callers to slowly become acclimated to the new interface. Of
course, the speech interface must be as good or better than the
existing touch-tone system. For example, if the speech system is
nothing more than a voice activated menu, callers may become
frustrated with the new interface when they experience recognition
errors, (which rarely occur with a touch-tone system). When upgrading
a touch-tone system, a natural language system may be too radical
a change for users and could result in more bail-outs.

The speech IVR system is going to automate what a real person
does on the phone now.

In this case it will be important to analyze the existing call flow.
Unless the calls are entirely scripted, most human dialogs tend to
be composed of a series of open-ended dialogs that each have a goal
or milestone, which may also depend on the results of a previous
dialog. When analyzing the calls, you want to be listening for these
dialog milestones. You can usually spot a dialog milestone by
listening for a dialog transition. A dialog transition occurs when
one participant changes the topic or focus of the conversation. For
example, "Ok, now I need to get your credit card
information". When a transition occurs, it usually signals the
end of a previous dialog. Dialogs will normally have data points in
which one participant communicates a piece of information that
translates into a form field value. This sounds easy enough because
you can translate this into VoiceXML dialogs. The difficulty is that
human speech is usually much more complicated. The caller may change
their mind midstream about a previous data point during a subsequent
dialog after the transition has already occured. For example,
"Actually, I need two widgets instead of one, and I want to pay
with a Visa card, but only if you can ship it overnight, otherwise
I’d like to pay COD with something like UPS". This instruction
is easy enough to process for a human, but incredibly complex for a
computer. Another caller might flip back and forth between logical
dialogs in a conversation.

There are two things to consider when evaluating how to automate
an existing human dialog interface. First, will it be feasible to
break down the human dialog interaction into discrete directed
dialog components so that callers will be able to communicate the
same information to the computer instead of a person in about the
same amount of time? When analyzing the conversation, it may seem at
first that this would be impossible, but as you listen to many
dialogs, you will be able to identify common dialogs that can be
broken down into discrete dialog components. 

Second, will a directed dialog be usable from the caller’s
perspective? Will the directed dialog be so different than the human
dialog that callers will become very confused and simply not use the
system? Maybe not. If the human agent asks the same series of
questions for every call, then a directed dialog would actually be
more natural than trying to consolidate the conversation in a more
mixed initiative dialog. 

On the other hand, if the entire conversation is dynamic and
unscripted, it may actually be impossible to create a directed
dialog. It is in these cases, creating a series of open-ended mixed
initiative dialog may make the most sense.

The IVR is a new application that will stand on its own or
extend a Web application
.

This is potentially the most difficult environment to work in,
because you have to make a lot of assumptions about how the speech
dialog "should" occur whereas in the previous two
scenarios, we have existing calls that can be analyzed.
Additionally, when integrating an IVR system with an existing Web
application, some people have a tendency to think of the IVR in
terms of it being a telephone mirror of the Web application, when in
the fact the interfaces may necessarily be very different. The one
advantage of integrating a speech IVR with an existing Web
application is that somebody has already gone through the pain of
breaking down the business logic into a programmatic structure.
There is also the benefit of not having an existing call flow to
analyze; there isn’t a legacy of expectations that have
to be considered in deciding whether or not to use a natural mixed
initiative dialog. If it turns out that an open-ended dialog will be
the most natural and efficient way to program the application, we
are free to do that (except for resource and time constraints of
course).

What is Feasible?

As we’ve already discussed, some applications are naturally
suited to become directed dialogs. Others that are open or
conversational may require a mixed initiative dialog. In some cases,
it may simply be impossible to practically create the application
using one of the styles. This should be identified early on in the
process. If that happens, then the decision is clear and it becomes
a matter of whether the cost and effort will justify the end result.

Examine the Difference in Effort

In terms of measuring the difference in effort between natural
dialogs and directed dialogs, we actually need to think about
several different factors.

The average mixed initiative natural dialog will take several
orders of magnitude longer to develop and maintain than a directed
dialog. The skills required to develop a natural mixed initiative
dialog are also steep and require some knowledge of linguistics and
speech recognition.

The four areas of development that we can compare are:

  • grammars
  • prompts
  • error handling
  • maintenance

A directed grammar will contain a rather limited number of
possible utterances. For example, when you ask a caller for their
credit card type and give them a list of their options, there are
only a handful of possible responses. However, if you ask the caller
a more open ended question like, "How would you like to pay for
this?", the number of possibilities goes up quite dramatically.
Writing a grammar for an open ended question requires us to
represent all of the possible answers that we might get from the
caller. Even for such a simple question, this is no small task.
In my humble opinion, however, programming natural dialogs is more
about how you handle recognition errors than actually focusing on
catching every possible utterance. In either case, natural language
grammars will always be larger and thus, will take more time to
develop.

Prompts in directed dialogs should be clear enough to eliminate
most ambiguities. Doing so limits the number of error handlers you
have to write. However, in a natural dialog, you will have to write
error handlers and prompts to go along with each possible utterance.
This will of course require more programming time and more time in
the recording studio to record the prompts.

Maintaining a natural mixed initiative dialog will also require a
higher degree of maintenance, because the grammars and prompts must
be regularly tuned to account for utterances that haven’t already
been accounted for. Directed dialogs will also need the same
maintenance, but not as frequently and won’t require as much work.

What Value in Natural Dialogs?

Sure, natural dialogs are cool compared to touch-tone or directed
speech menus, but coolness is not a final measure of whether a
natural dialog should be employed. Yes, people who are not familiar
with speech recognition may have an initial wow factor, but that
inevitably wears off and then the question is, why and when is a
natural dialog better?

The measure I use is this: If a natural dialog isn’t better from
a usability standpoint (which translates into fewer bail-outs) or
faster (callers can get the job done quicker, reduces call time)
when compared to a directed dialog, then go with the directed
dialog, which is quicker, easier and cheaper to build.

 Conclusion

The differences in directed dialogs vs. natural mixed initiative
dialogs can easily be an order in magnitude of 3 or more, so care
should be taken in making your decision, especially where cost and
time are concerned. This article should give you some ideas on where
to look in evaluating which approach makes sense in your case. If
you’re still not sure which approach to take after reading this, or
if you still have more questions, send me an email, [email protected]

About Jonathan Eisenzopf


Jonathan is a member of the Ferrum Group, LLC  which specializes in Voice Web consulting and training. Feel free to send an
email to [email protected]
regarding questions or comments about this or any article.

Latest Posts

Related Stories