The Longhorn speech API offers baked-in functionality for voice commands inside the operating system (OS). This is a giant leap forward in the functionality provided by an OS. With the addition of an alternate user interface, products in the future may have a much different mode of interaction than currently used. Through the course of this article, the basics of voice input and output will be shown, as well as the construction of grammar, and how to act on recognition.
The System.Speech.Synthesis namespace provides an easy way to read text output back to the user. By simply declaring a voice object, the developer can pass string output to the voice synthesis engine to be read back as speech.
Voice v = new Voice(); v.Speak("Hello world, I am a spoken Longhorn application.");
One notable annoyance with this can be seen in the provided sample code. For some non-alphanumeric characters, the engine decides to read the name for the character. For example, the following shows the difference between the inputted string, and what is read aloud to the user:
Today's game resulted in a 5 to 4 win (The Sporting News) Today's game resulted in a 5 to 4 win left parenthesis the sporting news right parenthesis
I imagine this is done to give those who are restricted to text-to-speech for their output as much of the full experience of reading the document as those who would simply read the text. There are some simple properties of the voice object that can be set to adjust it to your liking, such as priority, rate (speed), and volume, but the greater functionality comes through the supplied methods and especially the events. In addition to the Speak method shown in the previous examples, the Voice object also provides SpeakFile, as well as asynchronous versions of both methods if the developer does not wish to handle the threading manually. If the developer does wish to manually handle threading, events such as SpeakStarting, SpeakProgressChanged, and SpeakCompleted are provided. One could simply wire up a delegate to let the thread caller know that spoken output was finished, and it would be acceptable to listen again for voice input.
The code simplicity afforded by using the .NET framework for speech synthesis is just as easy, and even more powerful when used with speech recognition. All of the classes, methods, and events shown may be found in the System.Speech.Recognition namespace. For example, the Microsoft Speech SDK contains tools for Visual Studio to create grammar definition files. These files, with the extension “.grxml”, hold collections of rules and items to act on with the recognizer.
Grammar g = new Grammar(); g.Load("CarCommands.grxml");
This code will construct a grammar that will recognize any commands contained within the GRXML file. As a side note, the Visual Studio interface to the Speech SDK will only install on Visual Studio 2003 under Windows Server 2003 or Windows XP. This prevents a developer currently from installing the SDK under Whidbey and/or Longhorn, unfortunately. The end result of this is that the GRXML files must be coded by hand, or as XML files inside Visual Studio.
<grammar root="News" xml_lang="en-US" version="1.0" > <rule id="MyCarStartCommand" scope="public"> <one-of> <item>start my car</item> <item>car start</item> <item>go vrooom</item> </one-of> ...
A valid chunk of a GRXML file is shown above. This defines a rule named MyCarStartCommand, which is fulfilled whenever any of the items in the “one-of” node are recognized. In order to actually listen for the items to be found, however, events must be wired up to the grammar object. The speech grammar also has two essential events, used when listening to speech input, Recognition and NoRecognition; these represent the pass and fail conditions for the recognized text.
g.Recognition += new RecognitionEventHandler(g_Recognition); g.NoRecognition += new RecognitionEventHandler(g_NoRecognition);
The recognized handlers are now wired up, and the only thing left to do is to turn on the grammar to listen, done by setting the grammar’s IsActive property to true. The important thing to keep in mind here is that, at least in the current public version of Longhorn, the speech recognition and synthesis take up a large amount of processor usage, which can be minimized through proper turning on and off of listeners and asynchronous calls.
A great example of this is this scenario. In the code snippets shown above, there has been a grammar that recognizes three phrases to start up a car. Whenever the user says any of the three phrases, the g_Recognition method will fire, and immediately the grammar should be set to inactive until we are finished handling that input. This allows the user not to face a poorly performing app just because it is still listening for input. If, as in the sample application, the results of the speech recognition are speech synthesis results and the user has their speaker output where the microphone can listen, there is also the risk of the recognition handler acting on input from the synthesized speech.
I prefer to use the synthesis event, SpeakCompleted, to send back through a delegate the information that that thread has finished sending the output, and to set the grammar’s active mode back to true. In this way, the application will perform well because the main application part is not waiting for either more user input or speech synthesis completion. There are a plethora of other applications for this technology, and I’m certain that there will be a new breed of applications on the horizon with the forthcoming release of Longhorn.
A Look Ahead
Next month, there will be even more practical Speech applications, as well as a look at how to integrate application functionality into a Notification Tile, nicknamed “toast,” due to their nature of popping up.
One Last Note
All of the provided code compiles under PDC builds of Longhorn (4051), Whidbey (m2.030828-1205), .NET Framework (1.2.30703), and the Longhorn SDK. The sample application provided, GetMyNews, shows what can be done by implementing the techniques mentioned in the article.