Make the Cloud Listen (and Understand)

Yesterday I wrote a post about the changing cloud telephony landscape, and highlighted some key factors that will dictate which cloud telephony providers are around for the long haul and deliver the next innovations.

One of those factors – support for speech recognition – is a good differentiator for developers to use when choosing a cloud telephony platform.

Speech recognition is becoming increasingly important in our everyday lives. Smartphones and powerful handheld devices enable multimodality, and there are more and more restrictions placed on our use of phones while doing other tings (like driving).

Plus, I can’t think of a more deflating concept than a cloud telephony provider that allows developers to build sophisticated apps and mashups in the language of their choice but that chains users of those apps to a telephone keypad. No fun.

To give an example of how powerful speech recognition can be, and how easy it is to use with a cloud telephony provider that supports it, I worked up a small demo to illustrate the point. The sample code for this demo is on Github, and we’ll dive into it in more detail below.

This demo uses two PHP libraries that are designed to work with the Tropo platform (one of the only cloud telephony providers to support speech recognition):

If you’ve read any of my previous posts on build applications for the Tropo platform, you’ll see lots of similarities between this and previous sample apps. Here I continue my use of the insanely awesome Limonade Framework for PHP.

Let’s take the example of a company directory that allows callers to dial a single number, select a person or department at the company and then be transferred to the person they select.

With cloud telephony, there is no need to have such a system live on a machine in the server room – it can be hosted externally in the cloud, making it easier to manage and to scale. In addition, with the Tropo Platform, it doesn’t have to be the same tired old DTMF-based menu telling callers to press an extension number or to “dial by name…”.

Using the PHP WebAPI Library and Limonade, we can construct a simple, yet power script that looks like this:

This script is pretty self-explanatory, but there are some key points I want to emphasize. First, note the $options array that holds the reference to an external grammar file (more on that in a bit). Tropo seems to need for this reference to be an absolute one and not a relative reference to the file (not hard to do with PHP – you just need to be aware of it).

Also, the file reference needs to include a trailing parameter indicating that this is an XML grammar (;type=application/grammar-xml). This seems to be true even if the grammar file is served with the correct MIME type by whatever is serving it.

Now lets have a look at this grammar file.

This simplistic example demonstrates how to use the PHPGrammar library. Note the simple array structure that is being used to hold the details of employees for our fictitious company. This could very easily be replaced with a dip into a data source of pretty much any kind, like an LDAP directory or database holding employee details.

Also note in this example that we want to do something referred to as Semantic Interpretation. Our grammar file is a set of rules that will be applied to what the caller says – Semantic Interpretation (SI) dictates the value that is given to our application from the grammar when a successful match occurs.

In this example, we want the caller to be able to say the name of the person they want to be transfered to. We make the first name optional so they may either say the last name of the person or (optionally) the full name. Obviously this may need to be changed based on the size of the directory to render in a grammar file (e.g., multiple employees with the same last name).

Do note that the Tropo platform seems to require the “Script” sytax for returning SI values on a successful match as opposed to the “String Literal” syntax. (More on these alternatives here.)

Works on Tropo (Script syntax):
<item>foo<tag>out="bar";</tag></item>

Does not work on Tropo (String Literal syntax):
<item>foo<tag>bar</tag></item>

So, when a caller says the name of a person in our company directory we want to return the number for that person to our Tropo script so we can transfer the call to them. This can clearly be seen when we examine the Result object that is delivered by the Tropo platform.

Tropo’s Result object includes the full grammar engine output, and lots of very detailed information about the recognition. As you can see, the utterance that the speech recognition engine heard was the name of one of our faux employees. The value that was returned is the number of that person.

We use this value in the transfer_call() method of our Tropo script.


// Create a new instance of the Result object.
$result = new Result();

// Get the value of the selection the caller made.
$phone = $result->getValue();

// Create a new instance of the Tropo object and transfer the call.
$tropo = new Tropo();
$tropo->transfer('+1'.$phone);

// Write out the JSON for Tropo to consume.
$tropo->RenderJson();

Using the PHP WebAPI library, it takes just 5 lines of code (excluding comments) to get the value of the grammar result and transfer the call. How cool is that?!

Obviously there are lots of things that can be done to enhance this script, to make it more robust, but it illustrates the essential concepts of speech recognition in the cloud.

What’s more, because of all of the great functionality provided by the Tropo cloud platform we can really push the envelope on the tired old company directory:

  • We could take an inbound call from a Skype user and transfer to a cell phone (or a SIP endpoint).
  • We could let our caller select a department in our company and then ring several different numbers at once, transferring the call to the first one answered (sort of a “hunt group in the cloud”).
  • We could use Tropo’s built in IM capabilities to send a screen pop to the person receiving the call.

The sky is the limit. Which I guess is the point of cloud telephony…

How Will Technology Change by 2020?

It’s a good question — one that the people at the Pew Internet and American Life Project posed to a number of technology leaders. You can read their thoughts on where technology is going here.

Many of the respondents are optimistic about the future of speech recognition, predicting its wide adoption and broad use in our daily lives by 2020. Others are not so optimistic:

“Voice will continue to be the most over-sold, over-hyped, but unused interface,” noted Walt Dickie, executive vice president and CTO for C&R Research.

“Voice recognition has been a holy grail of computing since ‘Star Trek’ in the 1960s,” wrote Charles Ess, a researcher on online culture and ethics based at Drury University in Springfield, Missouri, a leader of the Association of Internet Researchers. “Like the artificial intelligence that was supposed to make it happen…it has faltered for a host of reasons, beginning with technical ones. Perhaps there will be some sort of technological breakthrough in the next few years that will make voice-recognition workable and affordable – but I’m not optimistic.”

Perhaps the issue is less about speech recognition technology advancing to the point where we can carry on conversations with helpful machines just as we do with other humans (a la Star Trek) and more about adjusting our expectations for the role that speech recognition can play in how we interact with devices.

I’ve always thought that speech could most effectively play a complimentary role to other types of user interfaces (keyboard, mouse, stylus, touch screen).

I predict that by 2020 we will have better aligned out expectations of how speech recognition technology can be used, and that we will have more fully embraced the notion of multi-modality.