Back to article

Designing an Interactive Voice Response System Using VoiceXML and CCXML

March 28, 2008


Voice interaction is by far the most natural choice of interaction for humans. When it comes to communicating with computers, interaction by voice brings in flexibility and ease of use to the consumer. The caller does not need to type in any information and can be on the move when requesting a service. Besides, a voice interactive system brings in a personal touch so often missing in the transactions taking place over the Internet, enhancing the user's comfort level.

Assuming that there is an underlying system for cognition of the interactions, there are three basic technologies required for an interactive voice response system (IVR):

  1. Voice recognition: Recognition of a phrase against a finite set of possible matches
  2. Voice transcription: Speech to text
  3. Speech synthesis: Text to speech (TTS)

Aim of the Article

In this article, you will learn the real-life implementation of a voice user interface in a travel portal. You will start with the basic design of the IVR. You will identify the various challenges of the system and the shortcomings of the preliminary design and proceed towards updating accordingly. You will conclude with the final architecture of the IVR.

Description of the Implementation

The implementation in discussion is a voice portal for travel planning. In this voice portal, the caller first will be authenticated by using an automatic voice recognition system. After the authentication, based on the role, the caller will be authorized to avail the travel related business service offered by the portal. He should be able to search various flights and plan a trip. The caller will be guided by the voice interactive system. Based on his consent and preferences, the caller can listen to various advertisements while his request is getting processed.

Designing the Voice User interface (VUI)

The voice user interface is developed using VXML2.1 and CCXML 1.0 and hosted on Voxeo Prophecy server 8.0. Nuance grammar specification language (GSL) is used to specify the grammar of the communication.


VoiceXML is a W3C standard XML format for specifying interactive voice dialogues and is used as a mechanism for building the VUI here. VoiceXML handles synchronous communication and allows you to interact with one user at a time, handling events thrown during interactions with that user through an audio channel. VoiceXML supports both application-directed as well as mixed-initiative interactions with the user.

In this application, the basic VUI is comprised of the various VoiceXMLs, as shown in Figure 1:

Click here for a larger image.

Figure 1: The Main VoiceXML Flow

<form id="Welcome">
<property name="bargein" value="false"/>
         Welcome to Travel Search Application
         <break time="1.5s"/>
<!--     <goto next="#getWebservice"/>     -->
         <goto next="#getUserCities"/>

<form id="getUserCities">
<property name="bargein" value="false"/>
<field name="srcCity" expr="undefined" cond="true" modal="true">
   <clear namelist="srcCity"/>
   <grammar src="grammar/city.grammar#CITY"
            type="text/gsl" mode="voice" />

<prompt bargein="false">Which city please?
   <break time="1.5s"/>
   <filled namelist="srcCity" mode="all">
      <assign name="departurecity"
      <assign name="departureCityCode"
      <prompt>Your city is <value expr="departurecity"/>
         <break time="1.5s"/>
      <goto next="#getUserListing"/>

<form id="getUserListing">
<property name="bargein" value="false"/>
<field name="listing1"
   <clear namelist="listing1"/>
   <grammar src="grammar/Airlines.grammar#AIRLINES"
            type="text/gsl" mode="voice" />

<prompt bargein="false">What Listing please?
   <break time="1.5s"/>
   <filled namelist="listing1" mode="all">
      <assign name="listing"
      <prompt>You have selected <value expr="listing"/>
         <break time="1.5s"/>
      <goto next="#connectUser "/>

<form id="connetUser">
<property name="bargein" value="false"/>
   <field name="answer">
      <prompt bargein="false">
         Would you like me to help you find the lowest fare
         before I connect you to <value expr="listing"/>
      <break time="1.5s"/>
         <grammar src="grammar/yesno1.grammar#YES_NO"/>
         <if cond="answer=='yes'">
            <goto next="#getWillDep"/>
         <elseif cond="answer=='no'"/>
            <goto next="#getAdvertise"/>

Listing 1: Snippet from userDetails.vxml

<form id="getWebservice">
   <property name="bargein" value="false"/>
         <log expr="' travelsearch: departureDate =
         <log expr="' travelsearch: returnDate =
         <log expr="' travelsearch: departureCity =
         <log expr="' travelsearch: destinationCity =

      <data name="MyData"

            departureCity='   + departureCityCode + '  &&
            destinationCity=' + destinationCityCode + '&&
            departureDate='   + departureDate + '      &&
            returnDate='      + returnDate"
         <assign name="document.MyData"
         <assign name="response"
         <log expr="' travelsearch: response = ['+response+']'"/>

      <exit namelist="response"/>

Listing 2: Snippet from userTravelSearchDetails.vxml

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="">

   <var name="resp"

   <form id="getWebservice">
         <log expr="'THE SEARCH VoiceXML ' + resp"/>
            <value expr="resp"/>
            <break time="1.5s"/>

Listing 3: userTravelSearchResults.vxml


Grammars are used to specify the constraints of the expected utterance, whether initiated by user or given to the system as a response to a directed dialogue. A sample grammar used by the VoiceXMLs in the application is shown in Listing 4:

(new [york yok yak])        {<city "new york ">      <code "NYC">}
(san [jose hose hoze])      {<city "sanjose ">       <code "SJC">}
[boston bostan bostun]      {<city "boston ">        <code "BOS">}
(san [francisco fransisco]) {<city "sanfrancisco ">  <code "SFO">}

Listing 4: Sample grammar file for Cities

Challenges of Voice User Interaction using VoiceXML

Often, the main challenge in VUI is the dependence on the speaker. Every speaker has a different voice where the various qualities of speech—amplitude, frequencies, and context—are different from those of another voice. In addition to that, the locale of the speaker makes a difference in the language and the dialect of the speaker.

To avoid this problem here, the VUI is designed to be a semi-automatic user interface. If the system fails to comprehend the caller's utterances, the call will be forwarded to a trained human operator without loss of any data. The operator will be in conference with the caller, record the information of the user session, and forward the request to the business service. The rest of the workflow will remain the same.

Because the current implementation of VUI is in a real-life commercial application, there also is a need to introduce advertising while the caller waits for the search results.

However, VoiceXML is not capable of handling multiple events, asynchronous communication, and call conferencing. Hence, CCXML is introduced to modify the workflow.


The call control markup language is the W3C standard that complements VoiceXML with the capabilities of advanced telephony, handling asynchronous and multiple events, conferencing, switching between various audio channels, and so forth. CCXML is used in the application in the following scenarios:

  1. Handling multiple events: In the travel search call flow, certain spots were selected for playing advertisement. One such scenario is 'search wait time'; two dialogs are executed in parallel:
    1. Fetch the advertisement and play until the travel search completes
    2. Search for travel details
  2. Call Conferencing:
    1. Conferencing with external human operator in case the user utterance is not recognized by the voice platform.
    2. In travel search, fetching the number for the listing from the DB and creating the call the dealer.

Using CCXML, the workflow is modified as shown in Figure 2:

Click here for a larger image.

Figure 2: State Diagram for CCXML and Modified VoiceXML Flow Using Advertising

Legend for Figure 2:

1. Gets the user details, including user preferences.
2A. Gets the advertisement information as per the user preferences, i.e, the Ad link.
2B. Gets the travel related details.
3A. Plays the advertisement.
3B. Gets the travel search related details and performs the travel search.
4. Gets the search results.

A sample CCXML file in this application will be as shown in Listing 5:

<? xml version="1.0"?>
<ccxml version="1.0">

<!-- Miscellaneous variables-->
<var expr="'initial'" name="currentstate" />
<var name="in_connectionid" />
<var name="out_connectionid" />
<var name="accepted" />

<var name="searchResult"/>
<var name="call_id"/>
<var name="dialog_id"/>
<var name="advertisingService"/>
<var name="businessService" />

<!-- Variables to calculate call length -->
<var name="callStart"/>
<var name="callEnd"/>
<var name="callLength"/>

<!-- Variable to hold value inputed by caller -->
<var name="departureCityCode" />
<var name="destinationCityCode" />
<var name="departureDate" />
<var name="returnDate"/>
<var name="response"/>

<!-- Variable flags to check which services returned -->
<var name="playedAd" expr="false"/>
<var name="completedSearch" expr="false"/>


<assign name="callStart" expr="new Date().getTime()"/>
<log expr="'*** creating another dialog. ***'"/>


<eventprocessor statevariable="currentstate">

   <transition event="connection.alerting" state="initial">
      <assign expr="event$.connectionid"
              name="in_connectionid" />
      <accept />

   <transition event="connection.connected" state="initial">
      <log expr="'*** Started main vxml . ***'"/>
      <dialogstart dialogid="main" src="'userDetails.vxml'" />
      <log expr="'*** executed main vxml . ***'"/>
      <assign expr="'welcoming_caller'" name="currentstate"/>

   <transition event="dialog.exit"
               state="welcoming_caller" name="evt">
      <!-- place the caller on hold -->

      <log expr="'*** in dialog.exit . ***'"/>

      <!-- Get the values from userDetails.vxml -->
      <assign name="departureCityCode"
      <assign name="destinationCityCode"
      <assign name="departureDate"
      <assign name="returnDate"

      <log expr="'DC =     ['+departureCityCode+']'"/>
      <log expr="'DestDC = ['+destinationCityCode+']'"/>
      <log expr="'DD =     ['+departureDate+']'"/>
      <log expr="'RD =     ['+returnDate+']'"/>

      <!-- Invoke advertising service and play the ad -->
      <dialogstart dialogid="advertisingService"
                   src="'advertisement.vxml'" />

      <!-- Invoke flight business service -->
      <dialogstart dialogid="businessService"
                       departureDate returnDate"/>

      <assign expr="'travelsearch'" name="currentstate" />


   <transition event="dialog.exit"
      <!-- get the response from travel search, pass it to
           another vxml to play -->
      <log expr="'MyEventName =   ['+evt.dialogid+']'"/>
      <log expr="'MyEventName2 is ['+businessService+']'"/>
      <log expr="'MyEventName2 is ['+advertisingService+']'"/>
      <log expr="Checking dialog exits."/>

   <if cond="evt.dialogid == businessService">
      <log expr="'MyData = ['+evt.values.response+']'"/>
      <log expr="exiting from travelsearch"/>
      <assign name="response" expr="evt.values.response"/>
      <assign name="completedSearch" expr="true"/>
      <log expr="'response = '+response+''"/>
   <if cond="evt.dialogid == advertisingService">
      <!-- get the response from travel search, pass it to
           another vxml to play -->
      <log expr="exiting from advertisement"/>
      <assign name="playedAd" expr="true"/>
   <log expr="'FLAG AD = playedAd'"/>
   <log expr="'FLAG Search = completedSearch'"/>
   <if cond="completedSearch == true">
      <if cond="playedAd == true">
         <log expr="ALL DONE. CAN TERMINATE!"/>
         <dialogstart dialogid="searchResult"
         <assign expr="'completed'" name="currentstate"/>


   <transition event="dialog.exit" state="completed" name="evt">

<!-- Clean up the call -->

   <transition state="completed" event="connection.disconnected" >
      <log expr="'Call has been disconnected.
                  Ending CCXML Session.'"/>



Listing 5: Sample CCXML file

IVR Architecture

The final architecture of the IVR is shown in Figure 3:

Click here for a larger image.

Figure 3: Architecture of the IVR


You need to understand the physical qualities of speech as well as the capabilities of the VUI technologies available today to create an effective IVR system.

About the Authors

Ponkumar is a Senior Software Engineer in the SOA Technology group in Photon Infotech, a next generation, high-end Internet consulting company. He has four years' experience in developing enterprise Java applications using open source Java frameworks and is currently working on a SOA implementation of an automated IVR system using ESB, Web Services, VXML, and CCXML. He can be reached at

Sujata De is a Director of Technology in Photon Infotech, a next generation, high-end Internet consulting company. She holds a Bachelor's degree in engineering from the Indian Institute of Technology, Kharagpur, followed by an M.S. from the Indian Institute of Science, Bangalore. She has worked extensively on web services, SOA, and J2EE technologies. She can be reached at

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date