It's Only Natural: Evaluating Natural Language Dialogs
Your decision on whether to use a natural dialog approach instead of a directed dialog in an IVR application will directly affect the cost, effort and maintenance of the system. This article will give you a process that you can use to make the right decision.
Natural Dialogs versus Directed Dialogs
A natural dialog is one in which the prompts, grammars and dialog flow are modeled and designed to more closely simulate a real conversation between two people. Natural dialogs allow the human to participate in controlling the dialog flow. Directed dialogs on the other hand use a pre-defined set of steps and usually occur in a sequential, linear fashion.
Directed dialogs are modeled in a dialog-flow fashion, similar to a call-flow for touch tone IVRs. Natural dialogs, on the other hand, typically utilize a finite state model where dialogs are executed based on the state of one or more variables.
There isn't a clear dividing line between directed dialogs and natural dialogs however; nor is there an agreed upon approach as to how they should be developed. Directed dialogs may certainly use a mixed initiative dialog as a shortcut mechanism for power users; and natural dialogs will certainly use directed dialogs.
Developers usually differentiate between directed and natural dialogs in terms of whether or not the dialog is mixed initiative. A mixed initiative dialog allows callers to fill multiple form field slots in a single utterance. For example, the utterance, "nineteen ninety ford escort" might fill the year, make and model fields with a single utterance, as opposed to three separate prompts. This illustrates one potential description of a natural dialog versus a directed dialog; not necessarily because it's mixed initiative, but because it is probably more natural to speak all three pieces of information during a phone conversation if you were describing the type of car you were selling (for example).
What is important to understand is that while researchers have tested various natural dialog approaches, there is no "right way" or set of guidelines that will help you create natural dialogs. I think this may have something to do with the fact that scientists don't understand the natural human speech machine in enough detail to effectively model it on a computer. So, remember, the definition of a natural dialog is a bit ambiguous and certainly open to interpretation. In fact, in some cases, a directed dialog may in fact be more natural than a mixed initiative dialog.
Evaluate the Existing Environment
There are three common environments that I see which affect design decisions:
The company has an existing touch-tone IVR that they want to upgrade to speech to reduce call times or reduce the number of callers that bail out of the IVR to a live rep. In this situation, introducing a new approach to callers that use the system regularly may actually cause more harm than good, because while the number of new callers pressing zero may be reduced, more experienced callers may become frustrated with the new interface and end up bailing out. If the percentage of repeat callers is high, and the ratio of repeat callers to new callers is high, then the end result could actually be more callers bailing out than before. Great care must be taken in transitioning to a new system. A recommended approach would be to gradually introduce speech into the application in a way that allows callers to slowly become acclimated to the new interface. Of course, the speech interface must be as good or better than the existing touch-tone system. For example, if the speech system is nothing more than a voice activated menu, callers may become frustrated with the new interface when they experience recognition errors, (which rarely occur with a touch-tone system). When upgrading a touch-tone system, a natural language system may be too radical a change for users and could result in more bail-outs.
The speech IVR system is going to automate what a real person
does on the phone now.
In this case it will be important to analyze the existing call flow. Unless the calls are entirely scripted, most human dialogs tend to be composed of a series of open-ended dialogs that each have a goal or milestone, which may also depend on the results of a previous dialog. When analyzing the calls, you want to be listening for these dialog milestones. You can usually spot a dialog milestone by listening for a dialog transition. A dialog transition occurs when one participant changes the topic or focus of the conversation. For example, "Ok, now I need to get your credit card information". When a transition occurs, it usually signals the end of a previous dialog. Dialogs will normally have data points in which one participant communicates a piece of information that translates into a form field value. This sounds easy enough because you can translate this into VoiceXML dialogs. The difficulty is that human speech is usually much more complicated. The caller may change their mind midstream about a previous data point during a subsequent dialog after the transition has already occured. For example, "Actually, I need two widgets instead of one, and I want to pay with a Visa card, but only if you can ship it overnight, otherwise I'd like to pay COD with something like UPS". This instruction is easy enough to process for a human, but incredibly complex for a computer. Another caller might flip back and forth between logical dialogs in a conversation.
There are two things to consider when evaluating how to automate an existing human dialog interface. First, will it be feasible to break down the human dialog interaction into discrete directed dialog components so that callers will be able to communicate the same information to the computer instead of a person in about the same amount of time? When analyzing the conversation, it may seem at first that this would be impossible, but as you listen to many dialogs, you will be able to identify common dialogs that can be broken down into discrete dialog components.
Second, will a directed dialog be usable from the caller's perspective? Will the directed dialog be so different than the human dialog that callers will become very confused and simply not use the system? Maybe not. If the human agent asks the same series of questions for every call, then a directed dialog would actually be more natural than trying to consolidate the conversation in a more mixed initiative dialog.
On the other hand, if the entire conversation is dynamic and unscripted, it may actually be impossible to create a directed dialog. It is in these cases, creating a series of open-ended mixed initiative dialog may make the most sense.
The IVR is a new application that will stand on its own or extend a Web application.
This is potentially the most difficult environment to work in, because you have to make a lot of assumptions about how the speech dialog "should" occur whereas in the previous two scenarios, we have existing calls that can be analyzed. Additionally, when integrating an IVR system with an existing Web application, some people have a tendency to think of the IVR in terms of it being a telephone mirror of the Web application, when in the fact the interfaces may necessarily be very different. The one advantage of integrating a speech IVR with an existing Web application is that somebody has already gone through the pain of breaking down the business logic into a programmatic structure. There is also the benefit of not having an existing call flow to analyze; there isn't a legacy of expectations that have to be considered in deciding whether or not to use a natural mixed initiative dialog. If it turns out that an open-ended dialog will be the most natural and efficient way to program the application, we are free to do that (except for resource and time constraints of course).