The goal of the VoiceXML Developer series is to provide a complete series
of tutorials that gives developers the insight they need to
develop professional quality VoiceXML applications. It’s recommended that
you read through each edition sequentially to attain a thorough understanding
of all of the concepts as they increase in difficulty.
A thorough understand of XML basics and server-side scripting is assumed.
VoiceXML is an XML format that utilizes existing telephony technology
to interact with users over the telephone through speech recognition, speech synthesis,
and standard Web technologies. The first edition of the VoiceXML Developer series will provide you
with a synopsis of VoiceXML and a glimpse into the technology used to develop VoiceXML
applications. Subsequent editions will go into the specific details of creating
The VoiceXML 1.0 specification was released on March 2000 by the
VoiceXML Forum which was founded by technologists from Lucent,
AT&T, IBM, and Motorola. The group was formed out of the need to
create a unified standard for voice dialogs rather than requiring customers to
learn several XML specifications that had been developed internally
within each of the member’s respective research labs (starting as early as 1995).
Other non-founders had also experimented with voice dialog XML formats
including HP’s TalkML and Sun’s Java Speech Markup Language (JSML).
All of this led up to October 2000, when the VoiceXML Forum released VoiceXML 1.0
to the Voice Browser Group (founded in 1998) of the World Wide Web Consortium (W3C),
the recognized standards body for the Web. This independent body has been working on
the second version of the specification and have announced that it will release a
revised specification sometime towards the end of 2001.
The nascent industry has grown rapidly since its millennium debut into a
market that is expected to reach $200 million dollars in 2001 and reach
$24 billion by 2005. The industry has been driven in part by an existing marketplace
that has utilized Interactive Voice Response (IVR) systems for call center automation;
think "Press 1 for your account balance. Press 2 to transfer fundsquot;.
You’ve probably used such a system to check your bank or credit card balances.
So VoiceXML fills an existing need for automation by improving upon the current
technology and making it simpler to implement and integrate into the rest of
the enterprise. VoiceXML also provides a new opportunity for companies that
have not been able to afford the cost or complexity of an IVR system by using
standard telephony components and leverage its existing Web infrastructure,
applications, and developer skills.
A VoiceXML system is made up of of a VoiceXML gateway that accesses static or dynamic
VoiceXML content on the Web. The gateway contains a VoiceXML browser (interpreter),
Text-To-Speech (TTS), Automatic Speech Recognition (ASR), and the telephony hardware
that connects to the Public Switched Telephone Network (PSTN) via a T1, POTS, or
ISDN telephone connection. A Plain Old Telephone Server (POTS) line is the type
that’s installed in your home and can only handle a single connection whereas
a T1 contains 24 individual phone lines.
A voice dialog typically consist of the following steps:
- Caller dials up the system on a fixed or mobile telephone which is picked up
by the telephony hardware which passes the call to the VoiceXML browser.
- The VoiceXML gateway retrieves a VoiceXML document from the
specified Web server and plays a pre-recorded or synthesized prompt.
- The user speaks into the telephone
or presses a key on their phone keypad (called DTMF tones)
- The telephony equipment passes the recorded sound to the ASR engine (if it’s speech),
which uses a predefined grammar contained in the VoiceXML document.
- The VoiceXML browser
executes the commands in the document based upon the ASR results (a match
against the grammar or not) and plays another pre-recorded or synthesized prompt
and waits for the user’s response.
Speech Recognition (ASR)
There are three leading products in the speech recognition world today; Via Voice from
IBM, Nuance 7 from Nuance, and OpenSpeech Recognizer from Speechworks. The leading non-commercial
ASR is Sphinx, a project maintained by the speech group at Carnegie Mellon University.
ASR works by taking recorded audio from a telephony card and using advanced algorithms
to match it against dictionary and grammars. A grammar defines sets of words and
phrases that it expects the users to speak.
Let’s use a stock trading example. We might want to define a grammar in
that recognizes the action the user wants to take (buy or sell),
the number of shares to trade, and the name or stock symbol of the company
to trade. So we would need to break the grammar down into the following parts:
- Recognize whether the user wants to buy or sell stock.
- Recognize the company name and associates it with a stock symbol
- Recognize the number of shares to trade
We would create a grammar rule for each item above and associate each
with a VoiceXML form field so that when the user says something like:
I want to sell 2000 shares of Microsoft stock.
the system will recognize that:
- the user wants to sell rather than buy
- the user wants to trade 2000 shares
- the user wants to trade Microsoft stock
This information would then be returned back to the VoiceXML interpreter
which propagates the results into the VoiceXML field values which are in turn
submitted to a back-end script for processing.
The final leaders in the speech synthesis (TTS) world are less clear, but the current
leaders are Speechify from Speechworks, Vocalizer from Nuance, and Fonix. Even the
best TTS engine is still sub-par for most listeners, so limit TTS use to dynamic
content that can’t be pre-recorded by a professional voice talent. It’s possible
that we’ll see high-quality speech synthesis using limited domain synthesis
techniques from companies like Cepstral, but the timing of this technology
being introduced as a mainstream technology remains elusive.
TTS engines work using a number of algorithms that take pre-recorded speech
to form the sounds for words. As a starting point, the basic phonemes of the
language to be spoken (English) are recorded and filed away. These phonemes
are then combined to form words using a lexicon that tells the TTS what phonemes
make up a particular word. The words are combined to form sentences and so on
until the TTS has built the entire phrase, which is usually returned as a wav
VoiceXML contains elements that control things such as volume, speech, and pitch.
Unfortunately, vendors implement these features differently so tuning to your
specific platform is required.
VoiceXML gateways contain one or more telephony cards that handle things such as
digital signal processing, call control, and call bridging. The leading card manufacturers
are Dialogic (owned by Intel), Natural MicroSystems, Brooktrout, and Acculabs. For the
most part, VoiceXML abstracts the existance of this hardware. The developer is able
to focus completely on developing the VoiceXML content generated by the Web server
rather than programming telephony cards. Most of the vendors support a wide range of
connection options including T1, E1, ISDN, POTS, and ISDN.
While not as popular as interactive dialogs, VoiceXML can be used to synthesize
texts like books, articles, or even Web pages.
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <form id="form1"> <block name="block1">Hello, this is an example of a Voice XML document using synthesized text. As you can hear, it's a bit choppy. But I might be able to pass as a silon from battle star galactica. </block> </vxml>
The VoiceXML document above is a good example of a simple VoiceXML document.
vxml is the root element for VoiceXML documents in
the same way that html is the root element for HTML
documents. Most documents also contain a form element
that contains a combination of recorded or synthesized prompts as well as
form fields that users fill in with DTMF tones from keypad selections or from
spoken input. This example contains no fields, but a paragraph of text. Text
blocks are usually encapsulated inside block
The steps above sum up the activities that make up a single dialog interaction.
In fact, most VoiceXML applications allow the user to hold a continuous dialog
until they hang up. There are actually two types of voice dialogs that VoiceXML
handles: directed and mixed initiative.
A directed dialog is one in which the system controls when and how the user
can interact with the system. A good example are the numerous IVR system that allow
us to check our account balances. The system plays a pre-recorded prompt, giving
us a menu of selections and prompting us to push a number for a given item.
Once the selection has been made, the system either gives us the information
we’ve requested or plays another prompt for a sub-menu. For example:
Computer: For account balance, press one. For recent transactions posted you.re your account, press two. To transfer funds, press three. User: 3 (DTMF) Computer: To transfer from savings, press one. To transfer from checking, press two. User: 1 (DTMF) Computer: Please enter the amount to transfer using your keypad...
These systems are
effective but not friendly. They don’t allow the user to control the call flow
other than to select a pre-defined choice. VoiceXML provides the
<menu> tag, which gives us the same essential functionality as
a standard IVR system.
The value would be high enough if it gave us equivalent functionality,
but VoiceXML allows us to leverage recent advancements in speech recognition
quality to allow users to interact with systems in a more natural way;
through conversation. A mixed initiative dialog lets the user
make requests in the same way you might ask a co-worker for a piece
of information. It’s up to the VoiceXML developer to guide the the user
towards the right verbal commands and then to recognize them. For example
User: Transfer two hundred dollars from savings to checking. Computer: Please verify that you want to transfer two hundred dollars from checking to savings by saying yes, or say no to start over. User: Yes.
While choosing whether to use a directed dialog with menu selections or mixed
initiative dialogs depends on the need, let’s talk a little more about the specifics
of what VoiceXML can provide for menu-driven dialogs versus more open-ended dialogs.
First, like HTML forms, VoiceXML forms may contain multiple fields that can be filled
In fact, VoiceXML allows mixed initiative dialogs via the <form>
and <field> elements.
Despite the flexibility of a VoiceXML form, menus can also utilize voice recognition
technology in addition to recognizing phone key presses (or DTMF tones).
<menu dtmf="true"> <prompt>What is your favorite color? For red, say red or press 1. For blue, say blue or press 2. For Yellow, say yellow or press 3.</prompt> <choice next="red.vxml">red</choice> <choice next="#blue.vxml">blue</choice> <choice next="yellow.vxml#yel">yellow</choice> </menu>
The code segment above gives the user the choice of either using the phone
keypad to make a selection or by simply saying the color they prefer. The text inside
the choice element specifies the string that the ASR should try to
match. You could (and should) prompt for DTMF tones (“press 1”) or spoken
text (“say red”) but not both.
VoiceXML Deployment Costs
I’m often asked how much a VoiceXML system costs to deploy. Fortunately,
the range is wide and it depends on whether you need a dedicated system or
are willing to outsource to a Voice Service Provider (VSP). A dedicated VoiceXML
gateway usually starts around $100,000 for the hardware, software, and installation
depending on how many concurrent callers you need to handle.
On the low end, VSPs usually charge you per minute so you only have to pay
for actual use. Prices are a few cents more than you’re probably paying for
long distance service and the top providers (TellMe, BeVocal, and Voxeo) are
all quite good in terms of national coverage and pricing.
There really isn’t a firm middle ground yet (below $100,000), but we should
expect to see offerings in the $30,000 to $50,000 range as competition heats up
and competitors move to serve demand in the mid-sized enterprise space. We will
be looking at specific products in a future article and product reviews so that
you have a better sense of what the options are.
Developing VoiceXML Applications
As was mentioned previously, VoiceXML gateways retrieve VoiceXML files
over the HTTP protocol from any standard Web server. This also means that
dynamic applications can be built with the same languages and technologies
that you’re using to build Web applications today. This is truly one of
the great advantages of the technology. Furthermore, if you’ve gone to the
trouble of separating your business logic from the presentation logic,
you can leverage that same stored business logic to develop VoiceXML applications
by swapping out the HTML presentation logic with VoiceXML content. Java Beans,
CORBA, and .NET are all technology architectures that encourage this type
of logic/presentment separation.
If all of your code is still embedded in a JSP, ASP, or Cold Fusion page,
don’t fret. You can leverage the existing code into new templates or take
this opportunity to separate the code logic into libraries or components.
We will address this process in a future article.
Vendors and Tools Support
Support for VoiceXML is nonexistent in most Web development
tools that you might be using now like Dreamweaver and BBedit. However, you
can use an XML tool like XMLSpy to develop and
validate VoiceXML documents. There are also several VoiceXML editors available
from independent providers like Voice Studio from Cambridge VoiceTech
and V-Builder from Nuance that are shaping up fast.
Support from big vendors is on the horizon however. IBM is one of the
few vendors that has integrated VoiceXML into its code editor for Web Sphere.
This isn’t su prising though since IBM is one of the leading VoiceXML
The future of VoiceXML
The W3C hasn’t made it totally clear what the next steps are beyond
VoiceXML 2 other than the specification drafts that have been published
in the past year. It seems likely that VoiceXML will be broken up into
several specifications that control various aspects of a voice dialog,
like speech synthesis or grammars. This will provide clarity and drive
industry adoption. It will also create complexity. We’ll have to wait
and see the balance that’s chosen in moving the VoiceXML standard forward.
What is clear, however, is that VoiceXML (or whatever it becomes) is here to stay.
One large technology vendor that has remained silent for some reason is Microsoft.
I expect that we’ll see something like Voice.Net in the future. It’s worth
noting that Microsoft licensed technology from Lernout & Houspie who
was the leading voice technology vendor until they filed bankruptcy
after creatively inventing some revenues in Asia.
Well, I hope you’ve enjoyed reading this introduction to VoiceXML as much
as I have writing it. I hope that you’ll come pack for the next edition of
VoiceXML developer as we learn more about VoiceXML.
- Nuance V-Builder . http://extranet.nuance.com
- IBM WebSphere
- Cambridge VoiceTech Voice Studio . http://www.cambridgevoicetech.com
- Voxeo . http://www.voxeo.com
- BeVocal . http://www.bevocal.com
- Voice Genie – http://www.voicegenie.com
- Cambridge VoiceTech – http://www.cambridgevoicetech.com
- CTLabs VoiceXML Portal Report,
- Tellme More . http://www.voicexmlplanet.com
- VoiceXML Adventure – http://www.voicexmlplanet.com
- VoiceXML Planet . http://www.voicexmlplanet.com
- VoiceXML Forum . http://www.voicexml.org
- The Ferrum Group, LLC . http://www.ferrumgroup.com
VoiceXML Development Tools
Voice Portal MSP
About Jonathan Eisenzopf
Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia
that specializes in Voice Web consulting and training. He has also written
articles for other online and print publications including WebReference.com
and WDVL.com. Feel free to send an email to [email protected] regarding
questions or comments about the VoiceXML Developer series, or for more
information about training and consulting services.