JavaComparing Microsoft Speech Server 2004 and IBM WebSphere Voice Server V4.2

Comparing Microsoft Speech Server 2004 and IBM WebSphere Voice Server V4.2 content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Speech Markup Language

The IBM WebShpere Voice Server is a VoiceXML 2.0-enabled speech environment. The VoiceXML is aimed at developing telephony-based applications, and takes the advantages of Web-based applications delivery to IVR applications. Being different from IBM, MS is using SALT 1.0 within MS Speech Server. SALT is a set of light-weight extensions to XML, adding speech-enabled telephony to Web-based applications and bringing them into a multimodal model. SALT targets speech-enabled applications across all devices such as telephones, PDAs, tablet PCs, and desktop PCs. The VoiceXML focuses on telephony application development whereas SALT is focused on multimodal speech applications that can be accessed by the whole device. These points will help you choose which one will be used in your real speech-enabled applications. So far, we have seen that IBM also has been starting to provide a multimodal toolkit in related products.

Framework and Programming

The Microsoft Speech Server and Speech SDK are based on the MS .NET Framework. You have to install the .NET Framework and ASP.NET Speech Controls modules in the speech server (SES/TAS) development machines as well as the Web server. The MS Speech Application SDK is being seamlessly integrated with MS .NET Visual Studio 2003; when you install the SDK, all controls of the SDK will be appear on the Visual Studio 2003 development environment toolbar. When programming under ASP.NET, on the server side you can code in C#.NET or VB.NET and use JScript or VBScript to code on the client side. Plus, you are able to use ADO.NET to implement database access and transactions.

The IBM WebSphere Voice Server and SDK/Toolkit is a member of the IBM WebSphere software family that is a SUN Java Framework-based Web application environment. You have to install the SUN Java Framework in both the Voice Server and development environment. The IBM Voice Toolkit for WebSphere Studio enables developers to create voice applications in less time, by using a VoiceXML application development environment, which includes a VoiceXML editor, grammar editor, and a pronunciation builder, and allows application developers to easily add voice technology to middleware applications. You can use Java2, Servlets, JSP, CGI, JDBC, JavaScript, and so forth to develop your speech application. The IBM WebSphere Voice ToolKit is integrated with the IBM WebShpere Studio development platform.

Components of Server and SDK

The main components of MS Speech Server consist of Speech Engine Services (SES) and Telephony Application Services (TAS). SES includes Speech Recognition Engine for accurately handling users’ spoken inputs, Prompt Engine for playing prerecorded prompts back users, and Text-to-Speech Engine using in playing Text-to-Speech by synthesizes audio output from a text string. The TAS contains a SALT Interpreter for dealing with all the speech interface and presentation logic (input and output) and interacting between the speech application and the telephony components of the architecture, Media and Speech Manager for handling requests made by SALT Interpreters to SES for speech recognition and prompt playback, and manages interfaces with the third-party TIM to deliver audio to and from the telephone user, SALT Interpreter Controller using in managing creation, deletion and resetting of the multiple instances of the SALT Interpreter that are managing dialogs with individual callers.

The MS Speech Application SDK provides ASP.NET Speech controls, Speech Control Editor, Speech Grammar Editor, Speech Prompt Editor, Speech Debugging Tools such as Telephony Application Simulator, Speech Debugging Console, Speech Debugging Console Log Player, Speech Add-in for Microsoft Internet Explorer, a speech application deployment service, and a broad set of grammar libraries. The IBM WebSphere Voice Server for Multiplatforms V4.2 includes VoiceXML voice browser, IBM Speech Recognition Engine, IBM TTS Engine, telephony and media component, and so forth. It can connect with many telephony platforms, including WebSphere Voice Response for AIX/Windows, Intel Dialogic, Cisco or Siemens HiPath, and Voice Server Speech Technologies for Windows and Linux.

The IBM WebShpere Voice Toolkit V4.2 can seamlessly integrate with the IBM WebSphere Studio visual development environment. Its components include a VoiceXML editor, grammar editor, pronunciation builder, CCXML editor, a lot of grammar libraries and Natural Language Understanding (NLU) model tools that help developers classify data for the generation of several statistical models, and also allow multiple developers to simultaneously work with the same set of data. The IBM WebShpere Voice Toolkit V4.2 also provides a telephony simulator used in development and testing.

Telephony Interface—Hardware and Software

For connectivity into the enterprise telephony infrastructure and call-control functionality, both IBM WebShpere Voice Server and MS Speech Server need the telephony interface of software and hardware. Intel Corp. and Intervoice Inc. provide a Telephony Interface Manager (TIM) that supports Microsoft Speech Server, which is a required component for any MS Speech Server voice-only solution. The TIM works in conjunction with the MS Speech Server, providing management and control over Intel Dialogic telephony resources. Using Call Manager software, developers can focus on speech application design and flow independent of the underlying telephony infrastructure. Also, the TIM is software that provides fast and easy integration of the speech server with the Intel NetStructure voice boards, enabling deployment of robust speech processing applications. Please note that multimodal applications do not require a TIM. The Intel version TIM is known as Intel NetMerge Call Manager.

Currently, the Intel Call Manager and Intervoice TIM support Intel Dialogic D41JCT, DM/V480, and DM/V960 telephony hardware ranging from 4 ports to 96 ports working with MS Speech Server.

The IBM WebSphere Voice Server provides software, telephony, and media component, used to manage the telephony interface. The IBM Voice Server also provides a set of C API used to integrate speech into a telephony platform. The IBM WebSphere Voice Server for Multiplatforms V4.2 can connect many telephony platforms, including WebSphere Voice Response (formerly IBM DirectTalk), Intel Dialogic voice boards, Cisco and Siemens HiPath VoIP Gateway, and Voice Server Speech Technologies for Windows and Linux. For VoIP, you need to install the H.323 telephony component in the voice server.

The IBM WebSphere Voice Server is scalable, starting from basic analog telephony boards to high-density digital solutions with a T1/E1 interface, including Intel Dialogic D/120JCT, D/240JCT, D/480JCT, D/300JCT, and D600JCT. When integrating with IBM DirectTalk, it also provides support for CAS, ISDN, and SS7 signaling connections. The Cisco 2600 Gateway with 2T1 or E1 is supported by the voice server.

Call Controls

Both IBM WebSphere Voice Server and MS Speech Server can provide simple call controls such as transfer call, make call, answer call, and so on. In some cases, if you want to implement complex call controls functionality, you have to use the CCXML editor in IBM WebSphere Voice Server and use CSTA data extension controls in MS Speech Server.

CCXML, the Call Control eXtensible Markup Language, provides telephony call control that can be used in VoiceXML or SALT-based, speech-enabling applications. CCXML can provide the call management, event processing, conferencing, and such that VoiceXML and SALT lacked. Currently, MS Speech Server 2004 cannot support CCXML.

The CCXML editor of IBM WebSphere Voice Server extends the base XML editor in WebSphere Studio to provide a development tool for CCXML Call Control markup, for purposes of creation and modification of CCXML documents. This editor provides a set of functions similar to the VoiceXML editor (Preference management, formatting, validation, and so forth) except that it is based on the proposed CCXML standards.

CSTA, Computer Supported Telephony Application, is a set of API calls that provides an international standard interface between network servers and telephone switches; it was established by the European Computer Manufacturers Association (ECMA). In the MS Speech Server, the SALT interpreter CSTA data extension establishes a communication channel to the TIM for implementing call controls. When exchanging messages, typically the speech application makes requests to TIM and the TIM responds it. Right here, the <smex> element of SALT is used to exchanging messages, where XML messages are sent to the TIM by using the sent property of smex and received from the TIM by using the onreceive event. The XML message consists of CSTA XML service requests and events as defined in CSTA Phase III. The CSTA-compatible call controls functionalities can be implemented.

Deployment Environments

Both IBM WebSphere Voice Serve and MS Speech Server can be deployed on either standalone or enterprise architecturea. This fully depends on your real application architecture and application requirements.

Integration with Third-Party CTI and CRM

Both IBM WebSphere Voice Serve and MS Speech Server do not provide CTI support directly, but are able to integrate with third-party CTI products by integrating the speech platform with CTI software, such as Intel’s NetMerge CPS (formerly CT Connect), Genesys CTI, and Cisco ICM CTI. You can implement and customize many CTI features such as call routing, softphone, callback, screen pop, web chat, outbound, conference, and the like. They both also can easily integrate with CRM platform such as Siebel, PeopleSoft, MS CRM, SAP, and Oracle CRM.

OS Platform and Speech Recognition Language Support

The IBM WebSphere Voice Server V4.2 is able to be run on AIX, Windows, and Linux platforms. On different OS platforms, it supports different multi-languages. On AIX, it supports most languages, including Brazilian Portuguese, Canadian French, Cantonese, Dutch, French, German, Italian, Japanese, Korean, Simplified Chinese, Spanish, UK English, and US English. The IBM WebSphere Voice Server V4.2 supports a couple of languages on the Windows platform, such as Australian English, Brazilian, French, Portuguese, Spanish, UK English, and US English. The IBM WebSphere Voice Server V4.2 can be run on Linux, but supports German and US English only.

So far, MS Speech Server 2004 just works on the Windows 2000/XP/2003 platform as well as supports US English for speech recognition. The multi-languages support is in an ongoing beta stage until now.


In preceding sections, we described and compared the features of IBM WebSphere Voice Serve and MS Speech Server. These can help you make a suitable decision when you want to develop and deploy speech-enabled applications.

About the Author

Xiaole Song is a professional on designing, integrating, and consulting CTI, Contact Center, IVR, IP Telephony, CRM, and Speech application. He has performed various roles for Intel, Dialogic, Minacs, and so forth. Feel free to e-mail any comments about this article or consulting services to

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories