January 23, 2021
Hot Topics:

Top 10 Best Practices for Voice User Interface Design

  • By Jonathan Eisenzopf
  • Send Email »
  • More Articles »

6. If natural dialogs fail, fall back to directed prompts

Mixed initiative dialogs are great in that callers can fill multiple form fields with a single utterance. However, natural dialogs have a much higher speech recognition failure rate because of ambiguities in what callers think they can say and actually do say. When a failure occurs in an <initial> element (the mixed initiative block), it's a good idea to drop out of mixed initiative mode into a more directed dialog where callers are prompted for each piece of information rather than trying to figure out what was said and adjusting the natural language dialogs.

The dialog listed below is an example of a prompt that should be clear to most callers, however, a natural language prompt may elicit off topic utterances that aren't expected. In this case, we provide a bit more information in the fall-back prompt and list the words that can be uttered. This creates the expectation in the caller that they can only say one of the listed teams whereas the first prompt does so implicitly but not explicitly.

Computer: Whats your favorite baseball team?Caller:   I hate baseballComputer: Sorry, I didnt recognize that team.           Heres a list of choices. Just say the           name of the team when you hear it.           Astros, Cubs, DodgersCaller:   Dodgers!

7. Always confirm what was recognized

It's quite common for a speech recognition engine to return the wrong match for an utterance. For example, "sharp" and "shark" sound so similar, that even if we get a match, we can't be sure whether it was what was actually said. Even for grammars that have little ambiguity, it's important to confirm what was recognized.

There are a number of techniques for doing this. The easiest way is to confirm immediately after recognition:

Computer: What is your first name?Caller:   Jon.Computer: So your name is Jon. Is that          correct?Caller:   Yes.

For dialogs that include multiple prompts, this technique is time consuming and un-natural from a conversational perspective. Another technique that is similar to the way humans interact is to include an implicit confirmation embedded in the next prompt.

Computer: What is your first name?Caller:   MartinComputer: Ok Mary, how old are you?Caller:   No, my name is MartinComputer: Oh, sorry Martin. How old are you?Caller:   Thirty Two.

As you can see, we're repeating their name, which provides an implicit confirmation. What we want to do in this case is to listen for a negative response such as "No" or "that's not right" which tells us that the name we recognized was in fact incorrect. This negative response may also be followed by the correct utterance, which should be recognized and re-confirmed.

NOTE: When using implicit confirmations, you should provide a summarized explicit confirmation of all values later in the dialog. For example:

Computer: I would like to confirm your          information now. Your name is Martin.          You are thirty two years old...          Is all of this information correct?

The reason for this summarized explicit confirmation is that some speakers consider it impolite to disagree or interrupt the dialog or did not know that they could in fact do so.

8. Generate prompts based on recognition confidence scores

You will find that speech recognition errors are par for the course rather than a rare occurrence. The goal then becomes how to best handle these errors, whether by changing prompts, falling back to directed dialogs, or transferring to a real person. One such technique is to pre-emptively change prompts or explicitly confirm values when the recognition scores fall below 70%-75%. Borrowing from our earlier example, instead of implicitly confirming a name in the next prompt, we want to interrupt the dialog flow to confirm and/or correct the value when the recognition confidence score falls too low:

Computer: What is your first name?Caller: Martin (recognized Mary. confidence = 25%)Computer: Your name is Mary?Caller: No, MartinComputer: Oh, sorry Martin. How old are you?

This technique will reduce caller frustration and is actually the same error correction technique that is commonly used in natural spoken dialogs between two people. After all, human recognition is fallible too. No one expects you to hear everything they say all the time, but they do expect you to stop the conversation and correct one's understanding of what is being said before moving on.

9. Bail out if too many errors occur

There are cases where there is nothing that you will be able to do to recognize what is being said. This can happen when there is too much background noise, when the spoken language is not the caller's native language, or when the caller simply doesn't understand what is being asked of them. There are of course other cases too where speakers are lazy (slurred, unintelligible speech), not talking into the microphone, etc.

If the caller is important to you, kindly transfer the caller to a real person if the same error occurs more than twice. For example:

Computer: Would you like to check your           account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Whats my account balance?Computer: Im sorry, did you say account           balance or transfer funds?Caller:   Whats my account balance?Computer: Im sorry, but I still didnt           understand your request.           Please hold while I transfer you to           a customer representative

10. Keep TTS output to a minimum

Unless you are using limited domain speech synthesis or have extremely dynamic content that simply can't be pre-recorded, do not use text-to-speech for prompts. Instead, use it as a fall-back for data or prompts that cannot be recorded.

In short, pre-record all prompts.


These 10 best practices represent the most common problems and solutions that I've run into over the past couple years. I hope that you will find them valuable as you develop your own speech applications. If you have questions regarding any of these 10 best practices or disagree with one of them, send me an email to eisen@ferrumgroup.com because I'd like to hear your feedback.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC  which specializes in Voice Web consulting and training. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about this or any article.

Page 2 of 2

This article was originally published on November 1, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date