VoiceHow To Develop IVR Applications with Microsoft Speech Server

How To Develop IVR Applications with Microsoft Speech Server

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

The Microsoft Speech Server (MSS) 2004 was launched in March of this year. MSS 2004 is a Web-based, flexible, and integrated solution of both speech-enabled interactive voice responsive (IVR) and Web applications, used in conjunction with the Microsoft Speech Application Software Development Kit (SASDK) that could be integrated seamlessly and directly with the MS Visual Studio .Net development environment. The Microsoft Speech Server enables enterprises to cost-effectively deploy speech applications and allows enterprises to merge their Web and voice/speech infrastructure to create unified applications with both speech and visual access.

This article is the first in a two-part series that provides a discussion about how to build interactive voice responsive (IVR) systems using both MSS and SASDK. In this first installment, the focus will be on grammar and prompts design when building a speech application.

Normally, the life cylce of building a speech-enabling IVR application would be contained in four stages: design, development, deployment and turning. All of the stages would work around the three key elements of speech-enabled applications: dialog, grammar and prompts.

If you have any IVR development experience, you want to think about whether you need to set up a development environment with telephony hardware first. When you start to develop a speech-enabling IVR application (voice-only application) using MS SASDK and working on the development and test stages, you do not need to install a telephony hardware interface immediately. There was a telephony simulator within the SASDK and Windows IIS within your Windows 2000/XP/2003. As soon as you complete coding and unit testing, you want to deploy your speech IVR application on MSS 2004. You must install and configure a TIM (Telephony Interface Manager) and telephony boards in the TAS Server of MSS.


The grammars are intended for use by speech application recognizers. In a speech-enabled application, a grammar is a set of structured rules that identify words or phrases as well as specify valid selections in response to a prompt when collecting spoken input.

The syntax of the grammar format is presented in two forms, an Augmented BNF (ABNF) Form and an XML Form in the World Wide Web Consortium (W3C) Speech Recognition Grammar Specification Version 1.0. The ABNF is a plain-text (non-XML) representation that is similar to traditional BNF grammar. The JSpeech Grammar Format (JSGF) is derived from ABNF that is used in some VoiceXML-based speech application development environments. Another form is to use XML elements to represent the grammar constructs, called Speech Recognition Grammar Specification (SRGS). The Microsoft Speech Application SDK Version 1.0 (SASDK) currently supports XML-based grammar format.

The Microsoft Speech Application SDK Version 1.0 (SASDK) provides the Speech Grammar Editor tool. This tool presents a graphical approach to creating grammars in the Microsoft Visual Studio .NET 2003 development environment. The tool also provides syntax validation to assist the developer with grammar debugging.

The rule is the basic unit of a grammar in in the SASDK. A Grammar must contain at least one rule that defines a pattern of words and/or phrases. If the caller’s input matches that pattern, the rule is matched by the IVR application.

On the MS speech platform, the Grammar has two forms: a grammar file or an inline (static) script. Grammar files can be either XML files or compiled binary files with .grxml (XML) and .cfg (compiled) extensions. Inline grammars exist entirely within the code of a speech-enabled Web application; the QA control supports both a grammar file and an inline grammar at the same time. You can use the Grammar Editor tool to graphically set up grammar files.

In a real-world speech application, if you use too strict a grammar, it may result in no flexibility from the caller’s perspective in regards to what the caller can say. Otherwise, designing too many unnecessary grammar items may lead to lower effective speech recognition. The following is a grammar example that transfers a call from a speech-enabling IVR to either an appropriate phone queue or a call center agent.

<grammar xmlns_sapi="http://schemas.microsoft.com/Speech/2002/06/
         xml_lang="en-US" tag-format="semantics-ms/1.0"
         mode="voice" >
   <!--This is transfer grammar using in speech-enabled IVR-->
   <rule id="Transfer" scope="public">
         <item>Transfer to agent please</item>
      <tag>$.Transfer = $recognized.text</tag>

Because grammar files are simply XML format files, the MS SASDK can create grammars programmatically. The MS SASDK is SALT based; even if you do not have any SALT language skills, you can perform a speech-enabling application in an MS Speech box. Actually, if you like, you can use the SALT language to implement a speech IVR over MS Speech Server.


A prompt is a question or information spoken by a speech application. Typically, a prompt is a question, such as “To what extension do you want to transfer?” It can also be a greeting, such as “Hello, this is the ABC Corporation customer service line” or provide multi-choice direction, such as “Sales, press one or say sales; Marketing, press two or say marketing; Technical Support, press three or say technical support.”

In an MS Speech Server-based speech-enabling application, prompts are the only interface in which a voice-only application (speech-enabled IVR) interacts with the user. The multimodal applications do not have prompts. Prompts serve a number of functions in an application. Prompt functions can contain JScript code that allows an application to generate dynamic prompts at run time. Use the Prompt Function Editor to create and edit prompt functions.

The Speech Prompt Editor, included in the Microsoft Speech Application SDK Version 1.0 (SASDK), provides an interface for creating prompts. Use the Speech Prompt Editor to create, edit, maintain, and manage every aspect of a prompt database; each prompt project contains one or more prompt databases. A prompt database contains all the audio and data that define the application’s prompts. The prompts database contains recorded prompts (.wav files) and their transcriptions.

The Wave Editor is another useful tool within the MS SASDK; it is used to improve prompt quality. The prompts database stores prompts as .wav files. It displays a graphical view of .wav file data. It allows you to edit the word boundaries within a .wav file, and to cut, copy, and paste wave segments both between and within .wav files.

When used in voice-only applications, the Controls, QA, Command, and Application Speech Controls can include prompts in the property of controls. You can add a prompt to a control in one of the inline prompts and prompt functions.

An inline prompt is a piece of static text that the prompt engine plays when a control is activated in running time. A good example is a WelcomeQA control that plays greeting whenever a caller makes a call to IVR.

A prompt function can dynamically generate one or more prompts based on the IVR application’s dialog call flow at run time. You can use Tool, Prompt Function Editor to add prompt functions to a QA, Command, or Application Speech Control. For instance, a customer service hotline may generate a dynamical prompt depending on a customer’s level that is an ANI retrieved from the telephony interface. Here, ANI represents variables in the prompt function.

Prompt functions are written in JScript code and run on the client side of Web-based, speech-enabling IVR applications. The Prompt Function Editor stores prompt functions with a .pf ext. The .pf files generated by the Prompt Function Editor are UTF-8 encoded and contain an XML header that allows the Prompt Function Editor to maintain the file, and to validate prompts at design time by using the Prompt Validation tool. The Prompt Validation tool cannot validate dynamically generated prompts created by prompt functions that reside in a .js file.

Speech quality is important in speech-enabled IVR applications. A pre-recorded audio prompt is still the most desired output currently, but TTS-synthesized speech has advanced so that it is an effective alternative. Both can be used together and consider two levels of speech quality delivered by an application. In the development and testing stages, you can use synthesized voice TTS. In a production application, generally pre-recorded prompts are needed. At run time, if the prompt engine fails to locate any pre-recording for a segment, the application will use a TTS engine to synthesize that segment’s speech.

A prompt function example is as follows:

function PromptFunctionExample(sPrompt)
   var sPrompt = "";
   var sFailedMatchPrompt = " Sorry, could not find your information
                              in the database; please try again  ";
   var sMain = "Please say or enter your 6 digit account number now";
   var sBasic = "This is the Basic service line ";
   var sPremium = "This is the Premium service line ";

   if (siServiceLevel.value == "PREMIUM")
      sMain = sPremium + sMain;
   } else
      sMain = sBasic + sMain;

   .. .. ..


About the Author

Xiaole Song is a professional at designing, integrating, and consulting Telecommunication, CTI, IVR, Speech, Call Centers, and IP Telephony. Feel free to e-mail any comments about this article to xiaole_song@yahoo.com.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories