In this issue of the VoiceXML Developer, we’ll begin a complete walk through
of all elements included in the VoiceXML 1.0 specification. This issue introduces the
basic elements used to markup content for the voice Web. We will focus primarily on the
functionality that allows VoiceXML to control Text-To-Speech output.
The root element of a VoiceXML document is the <vxml> element, which is
similar to the <html> tag in HTML. The root element is preceded by an XML
declaration and an optional document type declaration.
<?xml version="1.0"?> <!DOCTYPE vxml PUBLIC '-//Nuance/DTD VoiceXML 1.0//EN' 'http://voicexml.nuance.com/dtd/nuancevoicexml-1-2.dtd'> <vxml version="1.0"> <form> <block>Hello, can anybody hear me?</block> </form> </vxml>
The DTD above points to the Nuance version of the VoiceXML 1.0 specification and
is necessary to run properly on the Nuance platform. You will need to change this DTD
to support your vendor or alternatively remove it altogether since it’s not required.
The <form> element is similar to HTML forms in that a form can contain multiple
fields, which are filled out and submitted by a user. VoiceXML operates in a similar
manner, albeit a different user interface. The <block> element, which is the
VoiceXML equivalent of the <p> HTML tag, synthesizes the enclosed text via a
TTS (or Text-To-Speech) engine.
A VoiceXML document
The following is a first look at a complete VoiceXML document that utilizes the
elements that we’ll be learning about today. If you are using a VoiceXML editor
such as V-Builder, you should be able to cut and paste the example into your editor
and play it. To demo this VoiceXML example, call VoiceXML Planet at 510-315-6666.
At the first menu, press one. At the demo menu, press 1 to hear the example below.
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <form id="form1"> <block name="block1">Hello, this is an example of a Voice XML document using synthesized text. As you can hear, it's a bit choppy. But I might be able to pass as a silon from battle star galactica. </block> <block name="block2"> <prompt>Voice XML provides some features for controlling how I pronounce words and phrases. For example, you can create a pause. <break size="large" msecs="5000" /> I can also emphasize a phrase. John Bigbootae, I <emp level="strong">must</emp> have that overthruster! </prompt> </block> <block name="block3"> <pros vol="1" rate="-50%"><audio src="../prompts/prompt1.wav" /> synthesized prompts.</pros> </block> <block name="block4"> <prompt>Sometimes, you may need to tell me how to pronounce a phrase such as a date, currency or abbreviation. Please mail <sayas class="currency">$10,000.55</sayas> into <sayas sub="world wide web consortium">W3C</sayas> account number <sayas class="digits">55432</sayas> by, <sayas class="date">October 11, 2001</sayas> or call, <sayas class="phone">800-555-1212</sayas> </prompt> </block> <block name="block5"> <prompt> You can also control the <pros pitch="+50%"> prosity of <pros vol="1" rate="-50%"> my speech including volume, pitch, and speaking rate.</pros></pros> </prompt> </block> </form> </vxml>
The example above contains five <block> elements. The first block
contains nothing but text, which is synthesized by the TTS engine. The second block
creates a pause with the <break> element and adds an emphasis to a synthesized
phrase with the <emp> element. The third block plays a pre-recorded prompt
with the <audio> element, followed by synthesized text, which uses
<pros> to increase the volume and decrease the speaking rate.
The fourth block calls <sayas>, which is used to
pronounce common character classes; in this case digits, currency, and a phone number.
Playing pre-recorded prompts with <audio>
<audio src="hi.wav">Hello there</audio>
The <audio> element is utilized to play
a pre-recorded prompt. The src attribute specifies the URL of the audio file (which
is usually a wav file). The <audio> element may also contain text, which is
synthesized via the TTS engine in the case where the server cannot retrieve the sound file.
We will be covering the process of recording prompts in more detail in a future article.
Controlling pitch, volume, and speed of TTS
You can emphasize synthesized words and phrases with the <emp>
element. For example:
<emp level="strong">Officer</emp>, you must have mistaken my Dodge Dart with another lime green automobile.
The level attribute can be set to strong, moderate, or reduced based upon
the emphasis you desire. The default is moderate.
The <pros> (short for prosody) element on the other hand
controls pitch, volume, and speed. For example:
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <form> <block name="block5"> <prompt> <pros pitch="+90%" rate="+40%">Hey turtle, you wanna race. Come on.</pros> <pros pitch="-40%" rate="-30%">Now rabbit, how many times do I have to win before you give up?</pros> </prompt> </block> </form> </vxml>
In the example above, we increase the pitch and speaking rate when the rabbit
speaks and reduce the rate and pitch when the turtle speaks. The attributes
of the prosody element can be increased or decreased by percentage points. The rate
attribute specifies the number of words that the TTS engine will speak per minute,
while the volume attribute controls the volume (1 is the maximum).
The controls for defining prosody were borrowed from the Java Speech Markup
Language developed by Sun (see the Resources section at the end of the article).
Use <sayas> to pronounce special character classes
I mentioned a little earlier that VoiceXML is capable of pronouncing certain
classes of text. For example, you wouldn’t want the TTS engine to pronounce
$220.25 as "dollar-two-two-zero-period-two-five".
Rather, you would want it to say, "Two hundred twenty dollars and twenty
five cents". VoiceXML also borrows the <sayas> element from JSML. The five
built-in classes defined in the JSML specification are date, digits, literal,
number, and time. Let’s take a look at a couple examples:
Your speeding ticket comes to <sayas class="currency">$250.00</sayas> plus tip. You must pay the fine by <sayas class="date">December 1, 2002</sayas>. Prisoner <sayas class="digits">5164</sayas> , what are you in for?
The <sayas> element also provides a sub attribute, which
allows us to control how the TTS engine pronounces words, phrases or abbreviations.
<sayas sub="world wide web consortium">W3C</sayas>
Control pauses with <break>
The <break> element forces a pause in the execution flow. It can be
used inside <audio>, <prompt>, and <pros> elements. The
length of the pause is specified by the msecs attribute. For example:
<block> <prompt>The current temperature in San Francisco is fifty eight degrees. <break msecs="5000"/> The traffic on the golden gate bridge is . . . </prompt> </block>
We will continue our tour of VoiceXML in the next issue. For now, some closing
thoughts on the elements that have been introduced so far. First, be forewarned that each
TTS engine is different. For example, it seems that one TTS engine counts milliseconds
differently for the <break> element than another. In addition, support for the TTS components
of the VoiceXML specification remain spotty and inconsistent. Some implementations
may not even recognize certain elements at all. Finally, when using elements like
as <pros> and <sayas>, make sure that the platform you’re testing on
is the same platform you’re deploying on or you will be in for big surprises.
Well, that’s it for now. I’ll see you next time as VoiceXML Developer
continues to dig deep into the voice Web.
About Jonathan Eisenzopf
Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia
that specializes in Voice Web consulting and training. He has also written
articles for other online and print publications including WebReference.com
and WDVL.com. Feel free to send an email to [email protected] regarding
questions or comments about the VoiceXML Developer series, or for more
information about training and consulting services.