Voice Introduction to SALT (Part 3): Applying SALT

Introduction to SALT (Part 3): Applying SALT

So far in our SALT (Speech Application Language Tags) series we have
learned about the syntax of key SALT elements and their usage. We have also seen a
preview of some of the applications that can be developed using SALT. In this article, "Applying SALT", we are going to go back to the
drawing-board and learn about the various elements and architecture of a SALT-based Speech Solution.

Multimodal & Telephony

As we know about IVR (Interactive Voice Response), touch-tone systems and telephony-based speech applications, the majority of these applications work using Speech/touch tone input and prerecorded or synthesized speech output. What we are really using here is a single modality "speech" (either as both
input/output in case of interactive speech recognition or just touch-tone input
and speech output in case of touch-tone style applications). Multimodality is where we can utilize more than one mode of the user interface with the application, similar to our normal human communications with each other.

For instance, consider an application which allows us to get
driving directions. While it is typically easier to speak the start and
destination addresses aloud (or better yet, even shortcuts like: my home, my office, my doctor’s office–based on my previously established profile), the turn-by-turn and overall directions are typically best viewed through a map and probably a summary of turn-by-turn directions as well, something similar to what we are
used to seeing at MapQuest’s web site.

In essence, a multimodal application, when
executed on a desktop device, would be an application very similar to MapQuest
but would allow the user to talk/listen to the system for parts of the application’s input/output as well. For instance, the starting and destination addresses are multimodal. Now imagine this same application using the same interface on a wireless PDA. Now we are talking about a true mobile/multimodal
application. If we let our imaginations go a little bit wilder, we could easily
extend the same application to the dashboard of our car or any other device we
can imagine working with…that’s really the vision, which given the current
state of technology isn’t far away. Another modality that can be added to the
example application would be a pointing device which would zoom the map, focusing
on that particular location.

So how does SALT fit in with all of this? Well, SALT has been built
upon the technology that is required for applications to be deployed in a
telephony and/or multimodal context.

SALT Architecture

Let’s say we are all set to go and implement our next generation
interactive speech driven SALT-based application. How should the
architecture be designed? As we can see in the diagram below, the application architecture for deploying SALT-based applications is similar to that of a web application, with two major differences. In this scenario the web application is also capable of delivering SALT-based dynamic speech applications (if the appropriate browser is capable of handling SALT, e.g. through an add-on or natively) and the presence of a stack which represents a set of technologies broadly representing the integration of speech recognition/synthesis and telephony platforms.

Note: this diagram is really a
conceptual representation, and where the SALT browser/interpreter and
speech recognition/synthesis components specifically fit in depends on the capabilities of the end-user device/browser. Actual implementation of the SALT stack may vary based on vendor implementations.

The speech recognition component (popularly referred to as Advanced Speech
Recognition (ASR)) is focused on recognizing spoken user
utterances and matching them to a list of possible interpretations using a
specified grammar. The speech synthesis component (popularly referred to
as Text to Speech (TTS)) is focused on dynamically converting text messages into
voice output.

The telephony integration component is focused on connecting the speech
platform with the world of telephones–the Public Switched Telephony Network (PSTN). This is typically achieved using telephony cards from vendors such as Intel/Dialogic connected via analog/digital telephony lines with your telephony provider (i.e. your phone company).

When multimodality is used, the regular web application delivery framework (based on TCP/IP/HTTP/HTML/JavaScript etc.) is used for delivering the web application. The speech/telephony platform is used for the "speech/voice" aspect of the whole interaction, depending on the nature of the connection and the location of the speech recognition/synthesis components. Of course, both of these
interactions can happen together seamlessly, as part of the same user session,
depending on the users choice.

.NET Speech SDK

You might be wondering where .NET Speech SDK fits in? The current
preview which is available from Microsoft’s site has really two components: (a)
an add-in for Microsoft Internet Explorer which recognizes SALT tags and allows
the user to interact with the application using the desktop’s microphone and
speakers/headphones and (b) a set of ASP.NET based Speech controls which allow
developers using Microsoft Visual Studio .NET to create multimodal/telephony
applications and/or add speech interactivity to existing web applications
developed using Microsoft .NET and ASP.NET framework.

I would like to point out that it is quite possible that a SALT-based
application could be delivered using a non-ASP.NET web application framework (e.g.
Perl or Java Server Pages). What the .NET Speech SDK provides is really the ease
of development in adding speech to your existing web applications or creating
new applications.

To be Continued

We will continue our exploration of SALT in the next article. We will
actually start developing a SALT-based multimodal and telephony application using Microsoft .NET Speech SDK, an extension to Microsoft Visual Studio .NET that is focused around building dynamic speech applications that are based on the SALT specification. You might want to get prepared by ordering the .NET Speech SDK Beta from the Microsoft site (a link is provided below).

Resources

About Hitesh Seth

A freelance author and known speaker, Hitesh is a columnist on VoiceXML
technology in XML Journal and regularly writes for other technology publications on emerging technology topics such as J2EE, Microsoft .NET, XML,
Wireless Computing, Speech Applications, Web Services & Enterprise/B2B
Integration. He is the conference chair for VoiceXML Planet Conference
& Expo
. Hitesh received his Bachelors Degree from the Indian Institute
of Technology Kanpur (IITK), India. Feel free to email any comments or suggestions about the articles featured in this column at [email protected].

Latest Posts

Related Stories