Semantic Interpretation of Speech

A couple of weeks ago, the W3C released the Semantic Interpretation for Speech Recognition (SISR) specification as a candidate recommendation. This is an important step forward in the development of the Speech Interface Framework, and will help further define the differences between voice applications built using VoiceXML and its sister technologies, and other ways of building voice applications.

To fully understand Semantic Interpretation, a little background is helpful. VoiceXML applications utilize grammars to define allowable utterances for caller interactions. The job of a speech recognition platform is to match as closely as possible the spoken input of a caller against a list of allowable utterance defined in a grammar, or to determine that there is no good match (in which case a “nomatch” event is thrown). When a successful match is made, a grammar can return to a VoiceXML application the exact sequence of words that was spoken (i.e., to fill a <field> variable) or it can return a representation of that information in another format.

Semantic Interpretation is the process of converting the raw text of spoken utterances into a representation of data that is more easily processed by computer programming languages.

In an application, knowing the sequence of words that were uttered is sometimes interesting but often not the most practical way of handling the information that is present in the user utterance. What is needed is a computer processable representation of the information, the Semantic Result, more than a natural language transcript.

To some extent, the facility to do this already exists within VoiceXML and the Speech Recognition Grammar Specification (SRGS). Most seasoned developers know that a specified value can be returned to a voice application from a grammar based on a match using the <tag> element. The value returned by the <tag> can be further processed by the application, as opposed to using the string of text that was matched from the grammar.

An example of a grammar that uses this approach can be viewed here. Note that when a caller speaks a name that is matched in this grammar, the value defined within the <tag> element is returned to the voice application for further processing (in this case, the value is returned to a <field> designated with a slot value of “department”).

The purpose of the SISR specification is to define an expanded syntax of the Semantic Interpretation result. Understanding how these results will structured is critical for making use of them within the larger context of a voice application.

Developers interested in Semantic Interpretation should also check out information on this topic in the VoiceXML 2.0 specification and the SRGS 1.0 specification.

Open Says Google

Google Talk — the IM and VoIP client from the Internet search giant — is now capable of interacting with a larger range of different clients because of enhancements to its interoperability.

Google Talk is built on the open XMPP specification, but when it was originally released several months back it was restricted for use with other Google Talk clients. The recently announced enhancements will make it possible for Google Talk to connect with other Jabber-based clients.

This is a major step forward, not only for Google Talk but for open standards that promote interoperability. Nice!

Also, For anyone with an interest in building IM apps that will now be able to connect with Google Talk, check out the PHP Jabber Class. It’s still a relatively young project, but I’ve found it useful for banging out prototypes of VoiceXML / IM applications. Worth a look.

Voice as the “Killer App”

The Churchill Club’s 8th annual “Top Ten Tech Trends Debate,” which took place in Palo Alto last week, was the scene for an interesting discussion about the future of voice technologies.

Steve Jurvetson, managing director of Draper, Fisher, Jurvetson, forecast that voice calls will become free within data networks and that voice becomes the killer app for metropolitan wireless data networks.

“Voice runs as a free incremental service over data networks,” said Jurvetson. “It’s just like the transition to e-mail, you don’t meter it. Communication is the fundamental lubricant of the economy.”

<myPrediction> With the proliferation of wireless data networks, and the increased takeup of voice as the communications mode of choice, look for voice applications to play an increasingly important role in how commerce is conducted. </myPrediction>

Read more in depth coverage of the event here.

Visual AJAX != Voice AJAX

There is an interesting article in the latest installment of Speech Technology Magazine – Jim Larson offers his thoughts on the use of the AJAX development technique in VoiceXML applications.

Larson points to a notable example of “AJAX-like” programming in VoiceXML. In fact, its one I have commented on before. My view of this technique hasn’t changed. It’s powerful, but it isn’t AJAX.

The way I view AJAX programming, there are two key hallmarks; processing of XML on the client using JavaScript and the DOM, and asynchronous communication with a back end server using HTTP. The approach referenced by Larson includes the former, but misses the latter.

Make no mistake, using the new VoiceXML 2.1 <data> element to fetch an XML document that can be utilized from within a VoiceXML script without a page transition is indeed handy. But there isn’t any asynchronous communication happening here – the external XML document is fetched only once, typically at the begining of script execution. For asynchronous behavior, we need to look to one of VoiceXML’s sister technologies – the Call Control eXtensible Markup Language (CCXML).

The CCXML 1.0 specification (a last call working draft from the W3C) includes the <send> element, which allows a developer to throw a user defined event that can be caught by the initiating (or another) CCXML document. The classic <send> example involves the throwing of a “timesUp” event that allows a developer to designate when a call should be terminated.

The good folks at Voxeo (leaders in the effort to formalize the CCXML standard) have added functionality to their platform that allows the <send> element to asynchronously send and receive information via HTTP. So its theoretically possible to construct a conference call application that sends a message to a back end script to kick off an IM message to the call organizer when there are 30 seconds left on the call. (Actually, that sounds pretty cool – may have to try that one soon.)

When it comes to AJAX in VoiceXML, the same functionality is generally available but the needs of developers are somewhat different. Callers to a VoiceXML application probably don’t perceive page transitions in the same way that users of a visual web application do. VoiceXML has fetching and caching functionality built into it that can be used to manage the delay between page transitions. It’s probably a better use of developer time to understand fetching and caching more intimately than to try and replicate AJAX in voice applications.