SecuritySpeech Authentication Strategies, Risk Mitigation, and Business Metrics

Speech Authentication Strategies, Risk Mitigation, and Business Metrics content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

The telephone was invented more than 150 years ago, and continues to be a very important means for us to communicate with each other. The Web, by comparison, is very recent, but has rapidly become a competing communications channel. The convergence of telecommunications and the Web is now bringing the benefits of Web technology to the telephone, enabling Web developers to create applications that can be accessed via any telephone, and allowing people to interact with these applications via speech and telephone keypads; that is, via voice browsers. See Appendix 1 for a brief discussion of voice browsers, including a toll-free telephone number where you can try one out for yourself.

Economic benefits are driving the current high level of interest in the use of voice authentication to secure these converged applications. This technology is saving companies a good deal of money when, for example, call center agents don’t have to handle PIN reset requests or ask customers questions to confirm their identities. Password resets are the second most common reason workers call help desks, accounting for about one in four help desk requests, according to the Gartner Group, an IT research company. At an average cost of $22 USD per call, according to Gartner, that adds up fast, especially for large-scale enterprises and midsize organizations.

Now, applications can use the user’s voice itself for authentication. The problem is choosing the level of security that best fits your needs. This article will examine many of the performance and cost vs. level-of-security issues that you should consider before making this choice. And, as with any kind of application, this choice should be made after executive-level decision makers are polled on the worth to the organization of the information to be exposed by the application.

First, I’ll look at how voice authentication compares to traditional methods as well as to other biometric methods of authentication. Then, with many of the shortcoming of voice authentication in full view, I’ll outline some of the rather compelling arguments behind its wide-spread adoption.

I’ll conclude with a brief discussion of three offshoots of basic speech authentication:

  1. How Nuance’s Voice Platform, a complete VoiceXML enterprise speech solution, provides real-time and other reports for the determination of system performance, ROI, and so forth.
  2. How the RSA division of EMC, a pioneering and still solid purveyor of security technology, is using voice authentication, in combination with other security measures, for the mitigation of risk.
  3. How some investigators are combining speech authentication, business process management (BPM) and Service Oriented Architecture (SOA).


When passwords are guessed or stolen, you usually aren’t aware of the loss at the time. On the other hand, with smart cards and the like, the likelihood of a scam is reduced. To break into an account, someone would have to steal your smart card, but then you’d be more likely to know of the loss in time to do something about it.

Although smart cards, plastic cards that carry password and identity data for digital authentication, and biometrics, be it identification of someone’s voice, fingerprint, iris or facial features, have taken off, most people still rely on passwords to gain access to bank accounts and computer databases.

Biometrics have an advantage over passwords and tokens in that they can’t be forgotten, although they too can be lost. (People can lose fingers in an accident, or temporarily lose their voices due to illness.) But, biometrics can’t be changed. If someone loses a key or an access code, it’s easy to change the lock or combination and regain security.

So, if someone steals your biometric—perhaps by surreptitiously recording your voice or copying the database with your electronic iris scan—you could be stuck. Your iris is your iris, period. The problem is, while a biometric might be a unique identifier, it is not a secret. You leave a fingerprint on everything you touch, and someone can easily photograph your face.

Within the aforesaid glut of options, each with its own obvious shortcomings, there are some very compelling arguments that favor voice authentication for certain applications.

In large-scale, mainstream consumer applications such as banking, brokerage, or telecommunication services, having an authentication method that can be used by all customers from their home, office or car is a critical requirement. One of the main problems with finger, iris, or facial biometrics is the inconvenience and significant investment required for scanners and other hardware devices.

Because voice authentication is completed over a telephone with no additional hardware or software required for the end user, it is the only biometric that can be implemented today for an entire customer base. Voice authentication can, of course, also be established by a user and his or her speech-enabled PC or handheld device.

Voice authentication is based on an analysis of the vibrations created in the human vocal tract. The shape of a person’s vocal tract determines the timbre and resonance of the voice, and everyone’s vocal tract is fairly distinct in shape and size. Thus, just as B flat on a French horn and a piano sound different, different vocal tracts will produce particular sounds.

Critics of voice authentication point out that identical twins may pass for each other, but in most cases, imposters fail. The most variable factor in many voice authentication systems is the quality of the microphone and phone line.

Comparative Security

According to some university-based studies, a voice authentication solution can be set up to offer security with a less than a 0.1% false accept rate (in other words, impostors being able to break into a system) even when the impostor has the correct password information. In a conventional password system, given that an impostor has the correct password, the impostor has a 100% chance of breaking into the system. Voice authentication, therefore, offers around 1000 times more security than a conventional password system … in this particular comparison.

Although now a few years out of date, a fairly rigorous scientific evaluation conducted on speaker verification technologies on behalf of the Australian Government is available at Reference 2.


During a brief enrollment process, the typical voice-authentication package might require that callers speak their ID and password. Speaker verification systems capture and analyze the speech to create a voiceprint that is stored in the system database. Voiceprints are not audio samples, but a matrix of numbers that measures behavioral characteristics of the way the person speaks, as well as physical characteristics of the person’s vocal tract. During verification, the caller speaks a password, which is then compared and scored against the voiceprint database.

Nuance speaker verification products, regarded by many industry analysts as the best, perform well even with transient voice changes (caused by colds, different emotional states, or background noise) and deliver consistent performance over wireline, wireless, or VoIP channels. And, these products also offer safeguards against recordings played over the telephone, or impressionists mimicking a user’s voice.

Nuance’s SpeechSecure can be used to deliver personalized service by verifying a caller based on an initial greeting. For example, if caller says, “Hello, it’s me,” the system might identify the caller, greet her by name, and then deliver customized or preferred services without further prompting.

And, SpeechSecure can use secret pass phrases to deliver security. Pass phrases are user-chosen phrases, in any language, that the caller must repeat to gain access to information. In this way, SpeechSecure doesn’t simply calculate voice similarity, it ensures that the caller knows what to say.

Choice of Verification Strategies—Adaptation

SpeechSecure supports multiple verification strategies, allowing developers to select their preferred balance between call security and transaction speed. This authentication package is now delivered with voice model adaptation where voiceprint models are optimized in deployment using data from successful verifications, enabling the application to adapt to possible changes in users’ voices over long periods of time.


SpeechSecure is language independent and does not require a grammar, allowing callers to speak any pass phrase in any language they choose—in fact, the pass phrase doesn’t even need to be a real word.

One-step Verification

The caller is prompted for a piece of known data (for example, his account number). Speech recognition and verification are performed on the utterance. One-step verification is more convenient than having the user separately identify himself through an account number and then verify his identity through verification of a separate pass phrase. However, because the account number that is uttered is not chosen by the caller—and therefore is not secret—the method is less secure than two-step verification.

Two-Step Verification

Two-step verification operates on two separate passwords. First, the one-step process described above is carried out (in other words, the user speaks her account number, and that is passed through recognition and verification processes). Then, the user is prompted for an additional secret password that she has defined. Two-step verification is more secure than one-step verification because the confidence from verification of two separate passwords is better than that from one password, plus the system is testing the caller’s knowledge as well as the voiceprint.

Two-Step Verification with Random Challenge

This is two-step verification where the second password is not defined by the caller; rather, it is a random phrase generated by the application or the developer. This could be a random string of digits or one of several enrolled secret phrases. This method usually requires the use of speech recognition in tandem with verification to verify that the correct phrase or digit string has been uttered. One potential drawback of the digits approach is that it does not require the caller to know a secret password. However, it does protect the system from so-called tape recorder attacks (where an impostor somehow captures the real user’s password on tape and plays it over the phone) by ensuring that the user is talking live on the phone.

Text-Independent Verification

Text-dependent speaker verification requires that the same password used for enrollment be used for verification. Text-independent speaker verification places no constraints on the verification utterance and verifies or rejects the caller regardless of what they say or which language they use. Text-independent speaker verification requires more speech data than text-dependent speaker verification. However, it requires less cooperation on the behalf of the users, so it is useful for unobtrusive verification of repeat callers.


You can incorporate speaker verification into any system architecture using Web services-based integration because the SpeechSecure solution, available via a Web Services interface, includes:

  • SpeechSecure Authentication Engine: Biometric software that identifies callers based on unique voiceprints.
  • SpeechSecure Server: Web services for use with any speech application platform incorporating voiceprint database management.

Thus, a secure VoiceXML application using speech Web services where a user can speak a free sentence, in English for instance, and receive a French translation, on the same modality (phone) or via another one (PC screen, for example) would be feasible in today’s IT enterprise ecosystem . See Reference 4 for a discussion of such an application and Appendix 3 for more on Business Process Management (BPM) and Service Oriented Architecture (SOA).

Identification, Authentication, and Authorization

Overall security involves identification, authentication, and authorization. Here’s the shorthand guide:

Identification: Who are you?

Authentication: Prove it.

Authorization: Here is what you are allowed to do.

The three concepts are closely related, but in a security system it’s critical that you tell them apart. Conflating the three—running them together, failing to distinguish each from the others—can lead to serious security problems. Fortunately, servers such as Nuance’s Verifier that integrates with their speech recognition system can recognizes callers, authenticate them, and give them the appropriate access. More on this subject is available at Reference 3.

Nuance Management Station

The Nuance Management Station offers systems management, administration, and analysis capabilities designed to address the unique requirements of voice-driven services. System administrators and operators can manage and maintain all aspects of their speech systems to help ensure high service availability. Business managers can assess how well the system is delivering on key company objectives. See Figure 1, one of many such reports.

Track and Analyze Business Metrics

Within Nuance Management Station, its ROI Tracker enables call center managers to automatically measure and report in real-time the system’s cost-saving or revenue-generating performance against specific success metrics defined during the early phases of the speech application planning process. Call center managers also are able to conduct comprehensive business analysis using historical, trend, and performance data.

Figure 1: A call volume report

Assess Systems Performance

Systems administrators can retrieve data and conduct analysis pertaining to provisioning requirements, service performance, and CPU and memory utilization in order to increase system up time and operational efficiency.

Test and Tune the Speech Application

Speech application tuners can conduct in-depth usability assessments and take advantage of the system’s Listen & Learn feature, which can automatically tune the system based on the unique speech patterns of users interacting with the service. Standard and customizable reports deliver in-depth speech recognition related statistics, allowing the tuner to analyze, pinpoint and tune trouble spots in the application to increase automated call completion rates, and improve the customer experience.

Multiple Verification Strategies—Risk Mediation

RSA Adaptive Authentication for Phone is the industry’s first risk-based, multi-factor authentication (MFA) solution designed to protect a financial institution’s telephone banking customers. Utilizing several factors to authenticate telephone banking users, Adaptive Authentication for Phone helps financial institutions reduce fraud through increased security and audit trails, reduce costs through automation, and address regulators’ recommendation for stronger authentication—all without burdening the end-user experience.

This risk-based authentication technology considers a series of parameters and generates a unique risk score, based on the likelihood that a given telephone transaction or activity is fraudulent. Adaptive Authentication for Phone provides behind-the-scenes authentication, allowing the majority of telephone banking callers to continue uninterrupted with their transaction. Only callers or transactions flagged as high-risk by the RSA Risk Engine are challenged with secondary authentication in the form of voiceprint matching, content matching, challenge response questions, and one-time passwords.

Although other solutions are designed to monitor only one or two risk parameters (such as a voiceprint match), Adaptive Authentication for Phone measures several phone-specific parameters—in addition to a voiceprint—to authenticate telephone banking callers and transactions. Supported by the Risk Engine, Adaptive Authentication for Phone considers factors such as Automatic Number Identification (ANI) matching and user behavior profiling (“Is this typical behavior for this user?”) in assessing the risk associated with a transaction and generating a unique risk score for the financial institution to use.

The RSA eFraudNetwork is a cross-institution, crossplatform repository of known fraud data gleaned from RSA’s extensive network of banks, credit unions, debit and credit card issuers, ISPs, and third-party contributors across the globe. When a suspicious or confirmed fraudulent phone number is identified, the information is entered into the shared centralized database. The information then is disseminated to eFraudNetwork members in real-time to prevent future attacks, thus providing proactive protection to financial institutions and their customers.

Cross-Channel Protection

Securing the online and telephone banking channels is only the first step in creating a comprehensive cross-channel strategy. If silos still exist and technologies are not engineered to work together, even the best security solutions will do little to protect against the threat of cross-channel fraud. Tracking and identifying suspicious or confirmed fraudulent transactions across both channels demonstrates the compelling strength of cross-channel protection. For instance, a fraudster might compromise an online banking account to reset the password and change the genuine customer’s address. The fraudster might then use the same compromised credentials within the financial institution’s telephone banking system to transfer a large sum of money. By adding cross-channel security measures, such as in the previous example, a financial institution would immediately recognize the password and address change that occurred online the day before and deem the phone-based money transfer as a high-risk transaction. In turn, the fraudster would be challenged with secondary authentication, such as a one-time password or an additional voiceprint sample, to complete the transaction.

In short, Adaptive Authentication for Phone:

  • Performs a behind-the-scenes risk assessment for callers and applies additional security as needed to ensure the lowest impact on end users
  • Constructs audit trails for compliance and tracking
  • Considers a series of risk parameters including an optional biometric voiceprint (leveraging the Nuance Verifier technology) to ensure the highest levels of security and accuracy


Security features do not imply security! Security is a process, not a product. Security results not from using a few security features in the design of a product, but from how that product is implemented, tested, maintained, and used. Reference 11 contains an in-depth discussion of this subject.

Appendix 1: Voice Browser

A voice browser is exactly analogous to an ordinary web browser, except that instead of a keyboard, mouse, and monitor you use microphone, keypad, and speaker. Instead of the visually-oriented HTML, a voice browser processes pages of VoiceXML. Both kinds of browsers use the same Web infrastructure: HTTP, cookies, Web caches, URLs, secure HTTP, and so on. Because VoiceXML is a standard from the World Wide Web Consortium, just like HTML is, voice applications complying with the standard will run on any compliant VoiceXML voice browser, making your investments safe and future-proof. Voice applications are no longer locked into proprietary systems.

To talk with a voice browser, simply dial its phone number. You’ll be connected to a voice server that runs scores, hundreds, or perhaps even thousands of voice browsers, one per caller. When your voice browser starts up, it fetches and evaluates an initial page of VoiceXML it obtains from an ordinary Web server. This page tells the voice browser what to say to you, and also what to expect you to say in return. As the conversation proceeds, the VoiceXML page will reach a stage where it needs to submit information from it to the Web server. The Web server will process this information and generate the next VoiceXML page that you’ll listen to. Finally, the conversation is over when you hang up, or when the last VoiceXML page directs the voice browser to disconnect the call.

Your voice browser has an audio playback system that plays pre-recorded audio to you. It has a text-to-speech system that renders ordinary textual information into audio for you as well. When you respond to audio prompts, the voice browser’s speech recognition system extracts meaning from what you say.

To experience all of this first-hand, dial 1-800-555-TELL and talk away.

Appendix 2: Multi-Factor Authentication

Multi-factor authentication is becoming increasingly important as a defense to growing threats of security attacks, especially security attacks based on obtaining an individual’s password via trickery.

The Federal Financial Institutions Examination Council (FFIEC), which provides guidance to examiners and financial institutions on the characteristics of an effective information technology (IT) audit function, recommends that financial institutions employ two of the following three factors to maximize security:

Factor Example
Something the user possesses A token, ATM card, or USB device
Something the user knows A shared secret, password, or account number
Something the user is A fingerprint, iris scan, or voice print

Appendix 3: SOA and Web Services

SOA and web services are two different things, but Web services are the preferred standards-based way to realize SOA.

SOA is an architectural style for building software applications that use services available in a network such as the Web. It promotes loose coupling between software components so that they can be reused. Applications in SOA are built based on services. A service is an implementation of a well-defined business function, and such services can then be consumed by clients in different applications or business processes.

SOA allows for the reuse of existing assets where new services can be created from an existing IT infrastructure of systems. In other words, it enables businesses to leverage existing investments by allowing them to reuse existing applications, and promises interoperability between heterogeneous applications and technologies. SOA provides a level of flexibility that wasn’t possible before in the sense that:

  • Services are software components with well-defined interfaces that are implementation-independent. An important aspect of SOA is the separation of the service interface (the what) from its implementation (the how). Such services are consumed by clients that are not concerned with how these services will execute their requests.
  • Services are self-contained (perform predetermined tasks) and loosely coupled (for independence).
  • Services can be dynamically discovered.
  • Composite services can be built from aggregates of other services.

See Reference 5 for further discussion of Business Process Management (BPM), Service Oriented Architecture (SOA), and Web services.


  5. Harvey, M. Essential Business Process Modeling, O’Reilly (2005)
  6. Dunn, M. Pro Microsoft Speech Server 2007: Developing Speech Enabled Applications with .NET, Apress (2007)
  7. Anderson, E. et al Software Engineering for Internet Applications, MIT Press (2006)
  8. Kung, S. Y. et al Biometric Authentication, Prentice Hall (2005)
  9. Carr, H., Snyder, C. Data Communication and Network Security, McGraw- Hill (2006)
  10. Peinado, A., Segura, J. Speech Recognition Over Digital Channels, Wiley (2006)
  11. Daswani, N. et al. Foundations of Security: What Every Programmer Needs to Know, Apress (2007)

About the Author

Marcia Gulesian is an IT strategist, hands-on practitioner, and advocate for business-driven architectures. Marcia has served as software developer, project manager, CTO, and CIO. She is the author of well more than 100 feature articles on IT, its economics, and its management.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories