SpeechObjects are a set of open, reusable components that encapsulate the frequently used functionality in a speech application–components aren’t new to the world of application development. Depending on the choice of your development environment, as a developer you would use either EJB, COM, Microsoft .NET, etc. components as part of your application. The objective of SpeechObjects is to enable the reuse of these components and provide an object-oriented methodology to speech developers. SpeechObjects, which have been defined by Nuance Communications, can be used within the context of a VoiceXML application using the <object> and <param> tags.
BeVocal Cafe supports Nuance SpeechObjects methodology and allows developers to reuse a bunch of SpeechObjects developed by Nuance. In addition, Cafe includes a small set of speech objects which are specific to the Cafe environment and can be used by VoiceXML developers. The table below shows the various SpeechObjects that are supported by Cafe:
|An alphanumeric string
|Select an item by reading a sequence of items to the caller
|Credit-card related information
|A 10-digit telephone number
|Quantity (e.g. twenty two)
|A sectioned/delimited string
|A fixed-length digit string
|A time expression
|Amount in dollars/cents
|U.S. 5/9 digit postal code
|City and state
|A street in a particular city/state
|A street number in a particular street/city/state
To illustrate the value of SpeechObjects, let’s take a look at an example. The VoiceXML code snippet below shows a simple stock trading application prototype which recognizes an equity name or index and returns the name of the equity. The benefit that SpeechObjects brings to the table is clear from the simplicity of the code required to achieve the functionality. For instance, if this were to be coded in plain VoiceXML, the developer would need to create a fairly complex grammar which included all the equities traded on the stock exchange.
Whether you develop a client-server, web, wireless or speech application, security is always a concern. A key aspect of application security is authentication. An authentication mechanism allows an application to recognize a valid user for the application. In traditional web applications, authentication is typically handled through a combination of user-id and password. Some more secure web applications also allow the user to use a digital certificate as a token for authentication. In the world of speech applications, application authentication is typically managed through a combination PIN (personal identification number), Full Names (as cryptic user-ids can be hard to recognize), account numbers and/or telephone numbers. For instance a typical authentication dialog for a speech application would be something like:
"Please say or enter your account number" followed by
"Please say or enter your PIN."
This would allow the application to authenticate the user. The world of speech based applications allows a different form of authentication–a user’s speech itself. Similar to a fingerprint which serves as a token of identity for a person, a user’s natural speech can be constructed into a Voice Print which can recognize the user. Currently, VoiceXML doesn’t include pre-built support for Voice Print related technologies, however several vendors such as Nuance and SpeechWorks have built speech verification products into their core recognition technologies. Cafe provides support for Voice Print-based speaker verification to VoiceXML developers through two tags – <register> and <verify>. As the name probably suggests, the <register> tag is used to register a Voice Print of the user into an application, whereas the <verify> tag is used to verify that same voice print. Both tags have a common identifier, the "key expression," which is used to store/retrieve the Voice Print.
The listing below shows how the <register> tag can be used.
Now that the Voice Print has been registered, the <verify> tag can be used to authenticate a user.
A finger-print, hardware token, digital certificate or any other such mechanism requires changes to the way users communicate with an web application. In the speech application scenario, the authentication naturally “fits in” with the user interface (which is speech) and doesn’t require the user to adapt to any external identification mechanism. In fact, in a number of scenarios, interactive speech applications could be the de-facto choice due to the additional security that can be achieved through the mechanism of speech verification.
Typically in a VoiceXML based application, grammars are constructed based on rules that are created by the VoiceXML application, and these rules are based on grammars constructs and phrases. Some applications, such as a dynamically maintained address book, require the application to recognize entries (or at least part of the entry). One of the features that were recently added to the Cafe are known as "voice enrollment," which is an extension created for VoiceXML to suit this purpose. Enrollment works as a basic two step process. Initially the user records prompts in his/her voice for a grammar. Each prompt is assigned a value/expression which is returned when this prompt is recognized as part of the grammar. The listing below shows a simple enrollment process.
Once enrolled, the enrolled prompts can be recognized as a grammar as specified by the enroll tag. It is important to understand that the voice enrollment facility only works for individual assigned callers, as identified by the speakerID attribute. In most scenarios a VoiceXML application would cater to multiple telephony users. As part of the dynamic server-side application it is crucial to dynamically assign a unique speakerID for all speakers that use the voice enrollment.
Voice Enrollment is currently not part of the VoiceXML specification and is a BeVocal Cafe specific extension.
If you have you stayed in a hotel and requested the operator to set up a "wake-up call," typically the call is set up in an automatic fashion. The hotel system typically has some sort of an automatic interface which takes the time for the call and the room number as an input and as an end result you get a call in your room at the preset time. "Voice Alerts" or the "outbound VoiceXML calling interface" is similar to the wake-up call paradigm. In the VoiceXML-based solution scenario, a "Voice Alert" application initiates a call to particular phone number and instead of a static recorded message, connects the user to a dynamic VoiceXML application. Consider a stock trading application where the user needs to be notified if any particular stock in his/her portfolio goes up/down by a certain percentage. An outbound calling VoiceXML application scenario could be established which would notify the user of the event and also facilitate any related transaction (such as sell 100 shares or buy 100 shares).
BeVocal Cafe provides the outbound VoiceXML interface and allows developers to build event-driven interactive applications. To create a Voice Alert and connect it to a particular phone number and a VoiceXML application, Cafe provides a simple HTTP-based interface to initiate the outbound process. The interface takes three main parameters as input: dest – the destination phone number, vxml – the VoiceXML Application URL and key – an authentication token which validates the identity of the application invoking the interface. To use the Cafe outbound service, you need to firstname.lastname@example.org to get a valid key for your scenario.
In a nutshell, Cafe provides a function-rich environment for speech application developers by supporting multiple grammar formats, including a set of SpeechObjects and supporting functionality such as voice enrollment, speech verification and outbound calling. From an execution perspective, Cafe provides debugging and simulation tools such as a Vocal Debugger, Vocal Player and Vocal Scripter. For an overall VoiceXML development strategy, however, BeVocal Cafe lacks development tools for the actual construction of the VoiceXML application. My picks for future enhancements to Cafe would include a standalone grammar debugging tool and an overall development-focused environment (desktop or remote) which would jumpstart VoiceXML application development by providing code generation wizards for constructing VoiceXML applications.
About Hitesh Seth
Hitesh Seth is Chief Technology Evangelist for Silverline Technologies, a global eBusiness and mobile solutions consulting and integration services firm. He is a columnist on VoiceXML technology in XML Journal and regularly writes for other technology publications including Java Developers Journal and Web Services Journal on technology topics such as J2EE, Microsoft .NET, XML, Wireless Computing, Speech Applications, Web Services & Integration. Hitesh received his Bachelors Degree from the Indian Institute of Technology Kanpur (IITK), India. Feel free to email any comments or suggestions about the articles featured in this column at email@example.com.