February 28, 2021
Hot Topics:

Demystifying 10 Common Misconceptions About VoiceXML

  • By Jonathan Eisenzopf
  • Send Email »
  • More Articles »

Hold it. Stop right there. If you're in the process of deciding on using VoiceXML technologies for a critical business application, you've probably made many assumptions. Most of them are probably wrong. Before you take the plunge, or even if you have already done so, you might want to read how you can avoid the 10 most common pitfalls with VoiceXML implementations.


Drawing on my collective experience over the past couple years, I've noticed a common set of misconceptions that customers, start-ups and vendors often make. While I provide a set of best practices for VoiceXML practitioners in my VoiceXML Bootcamp training course, I haven't as of yet written about some of the common mistakes that are made both by the customers and developers as well as vendors. These mistakes usually originate because of a flawed set of expectations, which are based on incorrect assumptions. I imagine that many of you have already experienced the fallout of some of these misconceptions. For those that are new to VoiceXML, listen up. This list might save you some grief.

Speech Recognition is 98% accurate

This is a common figure touted by speech recognition vendors. The number can be a bit misleading in reality. It is true that speech recognition can be as much as 98% accurate as long as the speech grammars are limited and optimal. Limited means that the total possible grammatical combinations are relatively small. Having to match five hundred first names from a database is an example of a grammar that will have a less than 98% accuracy rate. A list of twenty names would be limited and could potentially reach 98%. 

What I mean by optimal is that the possible phrases that users can speak are dissimilar from each other. An optimal grammar cannot allow speakers to provide single letters or numbers, which have a higher failure rate than a longer word or phrase because they contain fewer phonemes (the basic sounds that make up a language). Additionally, a 98% accuracy is rare in a noisy environment. For example, a caller using a cell phone in their car in traffic with the window rolled down and the radio playing Puff Daddy would be a noisy, problematic environment.

The solution is to fall back to simpler grammars and step callers through a set of directed prompts rather than allowing them to speak more naturally; or to transfer them to a live representative. Your application must be prepared to offer alternatives when speech recognition fails--because it will fail at some point.

I don't think callers will like speech recognition

There are various opinions on this along with a few studies that provide data on this issue. It is true that callers usually prefer to speak with a real person instead of a speech recognition system. However, when given the option between a touch tone IVR and a speech recognition IVR, most callers will prefer a speech system. 

Interesting enough, one study by AT&T showed that older callers preferred speech while younger callers preferred touch-tone. However, in applications that contain more than three levels of menus or contain a complex series of prompts, most callers will prefer speech over touchtone where speech can get the caller to their destination faster and easier.

For example, let's consider an IVR system that allows a car dealer to check their inventory. In a touch-tone IVR system, the caller would either have to know the code for the given car make and model, or they would have wait for the system to provide them with the corresponding number:

"For Ford, press 1. Acura, press 2. Honda, press 3."

A touch-tone system would also require 3 prompts and inputs: make, model, and year.

With a speech recognition system, the task could be accomplished faster and more conveniently:

"How many 2002 Ford Explorers do we have in stock?"

There are many more practical examples where speech provides a more convenient alternative to otherwise overly complex touch-tone interfaces.

VoiceXML gateways are all the same

For the purpose of evaluating VoiceXML gateway vendors, it's easy to think, "Hey, they all support VoiceXML so they'll all function the same." It's been my experience that even though VoiceXML is a common standard, there are still areas of the specification that are left to interpretation, and certain limitations that vendors must address through proprietary mechanisms. For example, Nuance's TTS interprets the VoiceXML TTS tags differently that IBM's TTS. If you've timed and tuned the prosody for one, it'll sound completely different in the other.

A second area in which gateways differ is how they integrate with enterprise applications and databases. Some may provide tighter integration through application integration components, while others will leave the task to you.

A third area in which gateways differ is how they integrate with existing telephony infrastructures. Some gateways were really designed to stand alone and do not integrate well with an existing PBX, IVRs, ACD or telephony switch. Others will provide tightly integrated support for very specific equipment vendors.

Make sure you understand the telephony equipment that the gateway will need to integrate with. Make sure you understand how the gateway will integrate with your applications and databases. Finally, assume that switching to a different vendor's gateway will require modifications to code. 

Page 1 of 2

This article was originally published on November 6, 2002

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date