VoiceTop 10 Best Practices for Voice User Interface Design

Top 10 Best Practices for Voice User Interface Design

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Developing a VoiceXML IVR is straightforward. Designing a quality speech interface, on the other hand, is a long road paved with assumptions, mistakes and failures. Read Jonathan’s 10 best practices for VUI design and learn from his mistakes before you fall into the same traps.


It’s common (and understandable) for developers who have been working with VoiceXML for a few months to come to the realization that VoiceXML is fairly lightweight and easy to learn. The ability to develop a speech recognition application that is usable by the general public, however, requires specific non-technical skills. These skills are rooted in an understanding of human factors for the telephone interface and linguistics. This sort of experience comes over time as you learn from customers, make incorrect assumptions and fail. In this article, I will share my top 10 recommended “best practices” for voice user interface design with the hope that some of you will be able to circumvent these common mistakes.

1. Use DTMF for long numbers

Do not ask callers to speak a long number (like a bank account). Speech recognition fails frequently for words that contain few phonemes (the sounds that make up language) such as numbers and letters. Recognizing long spoken numbers might work in a lab environment, but in a real world test of hopping in the car, driving in the middle of rush hour, with the window rolled down, while on your cell phone, you can just imagine what happens. Instead of speech, have the caller enter their number on their telephone keypad. DTMF tones almost never fail.

The rule of thumb that I use is:

  • Limit spoken digits to 4 or less digits

Examples recognizable spoken digits are:

  • 4 digit PIN code
  • Year
  • DTMF menu

Examples of spoken digits that will fail:

  • credit card number
  • social security number
  • bank account number

Telephone numbers are in a gray area. I would recommend against recognizing spoken phone numbers unless you have a good reason for doing so.

2. Don’t use open ended prompts

"Hello, thank for calling iron chef theaters. May I help you?"

This is an example of an open ended prompt. While VoiceXML does allow for mixed initiative dialogs where callers have some control over the conversation, this is an example of an ambiguous prompt that will receive a very large range of responses. This exponentially increases the complexity and time required to develop an application that will be pleasing to callers. At the same time it dramatically reduces the likelihood that you ever will be able to do so because callers will think they can say anything and expect the application to understand. Instead, use prompts that explicitly or implicitly provide the caller with a list of options.

"Thanks for calling iron chef theaters. Would you like movie information, hours of operation, or would you like to speak to an operator?"

3. Use Anthropomorphism (but only in natural dialogs)

This item is hotly debated almost every time I present it, so I want to make sure you understand what I mean. Anthropomorphism is the concept of giving an inanimate object human traits. In a speech recognition application this means that the application likely refers to itself in the first person as though it were alive and aware of itself. For example:

Computer: Would you like to check your          account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your request.

The goal here is not to fool the caller into thinking that the application is actually a human being. In fact, that would cause many more problems that it solves. The purpose is to use language that is more naturally contusive to a verbal conversation.

An AT&T study on the topic showed that callers were more satisfied with applications that used first person in conversations even though callers knew that it was a computer rather than a human on the other end of the phone. 

This idea actually contradicts common practice for touch-tone applications, and that’s ok, because I’m only recommending the use of anthropomorphism for natural language (or mixed initiative) dialogs. Using anthropomorphism in directed dialogs may or may not be beneficial and I’m going to leave it up to you to decide if/when it’s appropriate in those cases.

4. Don’t repeat prompts

Callers will tend to repeat the same utterance when when a speech recognition failure occurs and the speech application simply repeats the same prompt. A good example of this problem is below:

Computer: Would you like to check your account          balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   I said whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Listen, I want my account balance!!!

To remedy this problem, you must change the prompt or intervene when an error occurs. A better dialog example that uses this technique is listed below:

Computer: Would you like to check your           account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Whats my account balance?Computer: Im sorry, did you say account           balance or transfer funds?

5. Focus on grammar accuracy

Callers prefer natural dialog to touch-tone menus, but they will not overlook errors that may result. What this mean is that you, as a developer, will spend a great deal of time tuning and refining your speech recognition grammars. You need to recognize that callers will expect (and take for granted) the same level of accuracy they have become used to with touch-tone systems. Touch-tone systems rarely fail to recognize a DTMF tone. 

To reduce the likelihood of recognition failures:

  • Create prompts that make it clear what the user can and should say.
  • Test grammars with many different utterances from several people.
  • Record incoming calls once the system is in production and use this information to continually tune the grammars.

Speech recognition requires more effort in design, development and maintenance. Takes these for granted and you will end up with users demanding the old (but reliable) touch-tone IVR.

6. If natural dialogs fail, fall back to directed prompts

Mixed initiative dialogs are great in that callers can fill multiple form fields with a single utterance. However, natural dialogs have a much higher speech recognition failure rate because of ambiguities in what callers think they can say and actually do say. When a failure occurs in an <initial> element (the mixed initiative block), it’s a good idea to drop out of mixed initiative mode into a more directed dialog where callers are prompted for each piece of information rather than trying to figure out what was said and adjusting the natural language dialogs.

The dialog listed below is an example of a prompt that should be clear to most callers, however, a natural language prompt may elicit off topic utterances that aren’t expected. In this case, we provide a bit more information in the fall-back prompt and list the words that can be uttered. This creates the expectation in the caller that they can only say one of the listed teams whereas the first prompt does so implicitly but not explicitly.

Computer: Whats your favorite baseball team?Caller:   I hate baseballComputer: Sorry, I didnt recognize that team.           Heres a list of choices. Just say the           name of the team when you hear it.           Astros, Cubs, DodgersCaller:   Dodgers!

7. Always confirm what was recognized

It’s quite common for a speech recognition engine to return the wrong match for an utterance. For example, "sharp" and "shark" sound so similar, that even if we get a match, we can’t be sure whether it was what was actually said. Even for grammars that have little ambiguity, it’s important to confirm what was recognized.

There are a number of techniques for doing this. The easiest way is to confirm immediately after recognition:

Computer: What is your first name?Caller:   Jon.Computer: So your name is Jon. Is that          correct?Caller:   Yes.

For dialogs that include multiple prompts, this technique is time consuming and un-natural from a conversational perspective. Another technique that is similar to the way humans interact is to include an implicit confirmation embedded in the next prompt.

Computer: What is your first name?Caller:   MartinComputer: Ok Mary, how old are you?Caller:   No, my name is MartinComputer: Oh, sorry Martin. How old are you?Caller:   Thirty Two.

As you can see, we’re repeating their name, which provides an implicit confirmation. What we want to do in this case is to listen for a negative response such as "No" or "that’s not right" which tells us that the name we recognized was in fact incorrect. This negative response may also be followed by the correct utterance, which should be recognized and re-confirmed.

NOTE: When using implicit confirmations, you should provide a summarized explicit confirmation of all values later in the dialog. For example:

Computer: I would like to confirm your          information now. Your name is Martin.          You are thirty two years old...          Is all of this information correct?

The reason for this summarized explicit confirmation is that some speakers consider it impolite to disagree or interrupt the dialog or did not know that they could in fact do so.

8. Generate prompts based on recognition confidence scores

You will find that speech recognition errors are par for the course rather than a rare occurrence. The goal then becomes how to best handle these errors, whether by changing prompts, falling back to directed dialogs, or transferring to a real person. One such technique is to pre-emptively change prompts or explicitly confirm values when the recognition scores fall below 70%-75%. Borrowing from our earlier example, instead of implicitly confirming a name in the next prompt, we want to interrupt the dialog flow to confirm and/or correct the value when the recognition confidence score falls too low:

Computer: What is your first name?Caller: Martin (recognized Mary. confidence = 25%)Computer: Your name is Mary?Caller: No, MartinComputer: Oh, sorry Martin. How old are you?

This technique will reduce caller frustration and is actually the same error correction technique that is commonly used in natural spoken dialogs between two people. After all, human recognition is fallible too. No one expects you to hear everything they say all the time, but they do expect you to stop the conversation and correct one’s understanding of what is being said before moving on.

9. Bail out if too many errors occur

There are cases where there is nothing that you will be able to do to recognize what is being said. This can happen when there is too much background noise, when the spoken language is not the caller’s native language, or when the caller simply doesn’t understand what is being asked of them. There are of course other cases too where speakers are lazy (slurred, unintelligible speech), not talking into the microphone, etc.

If the caller is important to you, kindly transfer the caller to a real person if the same error occurs more than twice. For example:

Computer: Would you like to check your           account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Whats my account balance?Computer: Im sorry, did you say account           balance or transfer funds?Caller:   Whats my account balance?Computer: Im sorry, but I still didnt           understand your request.           Please hold while I transfer you to           a customer representative

10. Keep TTS output to a minimum

Unless you are using limited domain speech synthesis or have extremely dynamic content that simply can’t be pre-recorded, do not use text-to-speech for prompts. Instead, use it as a fall-back for data or prompts that cannot be recorded.

In short, pre-record all prompts.


These 10 best practices represent the most common problems and solutions that I’ve run into over the past couple years. I hope that you will find them valuable as you develop your own speech applications. If you have questions regarding any of these 10 best practices or disagree with one of them, send me an email to eisen@ferrumgroup.com because I’d like to hear your feedback.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC  which specializes in Voice Web consulting and training. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about this or any article.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories