February 27, 2021
Hot Topics:

Top 10 Best Practices for Voice User Interface Design

  • By Jonathan Eisenzopf
  • Send Email »
  • More Articles »

Developing a VoiceXML IVR is straightforward. Designing a quality speech interface, on the other hand, is a long road paved with assumptions, mistakes and failures. Read Jonathan's 10 best practices for VUI design and learn from his mistakes before you fall into the same traps.


It's common (and understandable) for developers who have been working with VoiceXML for a few months to come to the realization that VoiceXML is fairly lightweight and easy to learn. The ability to develop a speech recognition application that is usable by the general public, however, requires specific non-technical skills. These skills are rooted in an understanding of human factors for the telephone interface and linguistics. This sort of experience comes over time as you learn from customers, make incorrect assumptions and fail. In this article, I will share my top 10 recommended "best practices" for voice user interface design with the hope that some of you will be able to circumvent these common mistakes.

1. Use DTMF for long numbers

Do not ask callers to speak a long number (like a bank account). Speech recognition fails frequently for words that contain few phonemes (the sounds that make up language) such as numbers and letters. Recognizing long spoken numbers might work in a lab environment, but in a real world test of hopping in the car, driving in the middle of rush hour, with the window rolled down, while on your cell phone, you can just imagine what happens. Instead of speech, have the caller enter their number on their telephone keypad. DTMF tones almost never fail.

The rule of thumb that I use is:

  • Limit spoken digits to 4 or less digits

Examples recognizable spoken digits are:

  • 4 digit PIN code
  • Year
  • DTMF menu

Examples of spoken digits that will fail:

  • credit card number
  • social security number
  • bank account number

Telephone numbers are in a gray area. I would recommend against recognizing spoken phone numbers unless you have a good reason for doing so.

2. Don't use open ended prompts

"Hello, thank for calling iron chef theaters. May I help you?"

This is an example of an open ended prompt. While VoiceXML does allow for mixed initiative dialogs where callers have some control over the conversation, this is an example of an ambiguous prompt that will receive a very large range of responses. This exponentially increases the complexity and time required to develop an application that will be pleasing to callers. At the same time it dramatically reduces the likelihood that you ever will be able to do so because callers will think they can say anything and expect the application to understand. Instead, use prompts that explicitly or implicitly provide the caller with a list of options.

"Thanks for calling iron chef theaters. Would you like movie information, hours of operation, or would you like to speak to an operator?"

3. Use Anthropomorphism (but only in natural dialogs)

This item is hotly debated almost every time I present it, so I want to make sure you understand what I mean. Anthropomorphism is the concept of giving an inanimate object human traits. In a speech recognition application this means that the application likely refers to itself in the first person as though it were alive and aware of itself. For example:

Computer: Would you like to check your          account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your request.

The goal here is not to fool the caller into thinking that the application is actually a human being. In fact, that would cause many more problems that it solves. The purpose is to use language that is more naturally contusive to a verbal conversation.

An AT&T study on the topic showed that callers were more satisfied with applications that used first person in conversations even though callers knew that it was a computer rather than a human on the other end of the phone. 

This idea actually contradicts common practice for touch-tone applications, and that's ok, because I'm only recommending the use of anthropomorphism for natural language (or mixed initiative) dialogs. Using anthropomorphism in directed dialogs may or may not be beneficial and I'm going to leave it up to you to decide if/when it's appropriate in those cases.

4. Don't repeat prompts

Callers will tend to repeat the same utterance when when a speech recognition failure occurs and the speech application simply repeats the same prompt. A good example of this problem is below:

Computer: Would you like to check your account          balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   I said whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Listen, I want my account balance!!!

To remedy this problem, you must change the prompt or intervene when an error occurs. A better dialog example that uses this technique is listed below:

Computer: Would you like to check your           account balance or transfer funds?Caller:   Whats my account balance?Computer: I did not understand.           Please repeat your requestCaller:   Whats my account balance?Computer: Im sorry, did you say account           balance or transfer funds?

5. Focus on grammar accuracy

Callers prefer natural dialog to touch-tone menus, but they will not overlook errors that may result. What this mean is that you, as a developer, will spend a great deal of time tuning and refining your speech recognition grammars. You need to recognize that callers will expect (and take for granted) the same level of accuracy they have become used to with touch-tone systems. Touch-tone systems rarely fail to recognize a DTMF tone. 

To reduce the likelihood of recognition failures:

  • Create prompts that make it clear what the user can and should say.
  • Test grammars with many different utterances from several people.
  • Record incoming calls once the system is in production and use this information to continually tune the grammars.

Speech recognition requires more effort in design, development and maintenance. Takes these for granted and you will end up with users demanding the old (but reliable) touch-tone IVR.

Page 1 of 2

This article was originally published on November 1, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date