Building Speech-Enabling IVR application Using Microsoft Speech Server 2004: Grammar and Prompts
The Microsoft Speech Server (MSS) 2004 was launched in March of this year. MSS 2004 is a Web-based, flexible, and integrated solution of both speech-enabled interactive voice responsive (IVR) and Web applications, used in conjunction with the Microsoft Speech Application Software Development Kit (SASDK) that could be integrated seamlessly and directly with the MS Visual Studio .Net development environment. The Microsoft Speech Server enables enterprises to cost-effectively deploy speech applications and allows enterprises to merge their Web and voice/speech infrastructure to create unified applications with both speech and visual access.
This article is the first in a two-part series that provides a discussion about how to build interactive voice responsive (IVR) systems using both MSS and SASDK. In this first installment, the focus will be on grammar and prompts design when building a speech application.
Normally, the life cylce of building a speech-enabling IVR application would be contained in four stages: design, development, deployment and turning. All of the stages would work around the three key elements of speech-enabled applications: dialog, grammar and prompts.
If you have any IVR development experience, you want to think about whether you need to set up a development environment with telephony hardware first. When you start to develop a speech-enabling IVR application (voice-only application) using MS SASDK and working on the development and test stages, you do not need to install a telephony hardware interface immediately. There was a telephony simulator within the SASDK and Windows IIS within your Windows 2000/XP/2003. As soon as you complete coding and unit testing, you want to deploy your speech IVR application on MSS 2004. You must install and configure a TIM (Telephony Interface Manager) and telephony boards in the TAS Server of MSS.
The grammars are intended for use by speech application recognizers. In a speech-enabled application, a grammar is a set of structured rules that identify words or phrases as well as specify valid selections in response to a prompt when collecting spoken input.
The syntax of the grammar format is presented in two forms, an Augmented BNF (ABNF) Form and an XML Form in the World Wide Web Consortium (W3C) Speech Recognition Grammar Specification Version 1.0. The ABNF is a plain-text (non-XML) representation that is similar to traditional BNF grammar. The JSpeech Grammar Format (JSGF) is derived from ABNF that is used in some VoiceXML-based speech application development environments. Another form is to use XML elements to represent the grammar constructs, called Speech Recognition Grammar Specification (SRGS). The Microsoft Speech Application SDK Version 1.0 (SASDK) currently supports XML-based grammar format.
The Microsoft Speech Application SDK Version 1.0 (SASDK) provides the Speech Grammar Editor tool. This tool presents a graphical approach to creating grammars in the Microsoft Visual Studio .NET 2003 development environment. The tool also provides syntax validation to assist the developer with grammar debugging.
The rule is the basic unit of a grammar in in the SASDK. A Grammar must contain at least one rule that defines a pattern of words and/or phrases. If the caller's input matches that pattern, the rule is matched by the IVR application.
On the MS speech platform, the Grammar has two forms: a grammar file or an inline (static) script. Grammar files can be either XML files or compiled binary files with .grxml (XML) and .cfg (compiled) extensions. Inline grammars exist entirely within the code of a speech-enabled Web application; the QA control supports both a grammar file and an inline grammar at the same time. You can use the Grammar Editor tool to graphically set up grammar files.
In a real-world speech application, if you use too strict a grammar, it may result in no flexibility from the caller's perspective in regards to what the caller can say. Otherwise, designing too many unnecessary grammar items may lead to lower effective speech recognition. The following is a grammar example that transfers a call from a speech-enabling IVR to either an appropriate phone queue or a call center agent.
<grammar xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/ SRGSExtensions" xml:lang="en-US" tag-format="semantics-ms/1.0" version="1.0" mode="voice" xmlns="http://www.w3.org/2001/06/grammar"> <!--This is transfer grammar using in speech-enabled IVR--> <rule id="Transfer" scope="public"> <one-of> <item>Transfer to agent please</item> <item>Transfer</item> </one-of> <tag>$.Transfer = $recognized.text</tag> </rule> </grammar>
Because grammar files are simply XML format files, the MS SASDK can create grammars programmatically. The MS SASDK is SALT based; even if you do not have any SALT language skills, you can perform a speech-enabling application in an MS Speech box. Actually, if you like, you can use the SALT language to implement a speech IVR over MS Speech Server.
A prompt is a question or information spoken by a speech application. Typically, a prompt is a question, such as "To what extension do you want to transfer?" It can also be a greeting, such as "Hello, this is the ABC Corporation customer service line" or provide multi-choice direction, such as "Sales, press one or say sales; Marketing, press two or say marketing; Technical Support, press three or say technical support."
In an MS Speech Server-based speech-enabling application, prompts are the only interface in which a voice-only application (speech-enabled IVR) interacts with the user. The multimodal applications do not have prompts. Prompts serve a number of functions in an application. Prompt functions can contain JScript code that allows an application to generate dynamic prompts at run time. Use the Prompt Function Editor to create and edit prompt functions.
The Speech Prompt Editor, included in the Microsoft Speech Application SDK Version 1.0 (SASDK), provides an interface for creating prompts. Use the Speech Prompt Editor to create, edit, maintain, and manage every aspect of a prompt database; each prompt project contains one or more prompt databases. A prompt database contains all the audio and data that define the application's prompts. The prompts database contains recorded prompts (.wav files) and their transcriptions.
The Wave Editor is another useful tool within the MS SASDK; it is used to improve prompt quality. The prompts database stores prompts as .wav files. It displays a graphical view of .wav file data. It allows you to edit the word boundaries within a .wav file, and to cut, copy, and paste wave segments both between and within .wav files.
When used in voice-only applications, the Controls, QA, Command, and Application Speech Controls can include prompts in the property of controls. You can add a prompt to a control in one of the inline prompts and prompt functions.