Confidence-Based Recognition in VoiceXML

Introduction

In the February 2005 issue of VoiceXML Review (the “Speak & Listen” column), Matt Oshry of TellMe networks provides an excellent overview of a technique referred to as “confidence-based confirmation,” or “confidence-based recognition.” This approach is highly valuable in applications that have unique or sophisticated grammars, or where high confidence levels during recognition is absolutely essential (i.e., telephone-based voting systems).

The approach discussed here takes a slightly different approach to this technique by structuring the dialog used to confirm user input as a subdialog, making it (in my opinion) a bit easier to reuse in different parts of an application. I’m also going to demonstrate the power of a new bit of functionality contained in the developing VoiceXML 2.1 specification that allows developers to structure applications to capture the audio of a caller utterance while a recognition is performed. If you’ve read my paper on telephone-based voting systems, you’ll understand why I believe that this functionality is central to VoiceXML-enabled telephone voting.

By way of a little background, there are other ways to raise the required level of confidence for recognitions in VoiceXML applications. For example, the “confidencelevel” property allows developers to specify the degree of accuracy needed for recognitions in a VoiceXML application. Changing the value of this property from its default setting of .5 (50 percent confidence) to .80 will require a higher degree of accuracy before a successful recognition will occur

<property name=“confidencelevel“ value=“.8“/>

However, this is a rather blunt instrument to use if you require a higher degree of accuracy for very specific recognitions. Changing the value of this property will raise the required confidence level for all recognitions within the scope where this property is changed. If this property is set within the application root document, it will have application level scope and apply to all recognitions in the application (probably not a good thing). Additionally, simply changing the value of this property will do nothing to change the way a platform handles a “nomatch” event. You’ll still need to craft some custom event handlers for this, unless you want some frustrated callers.

Platform Support for VoiceXML 2.1

While support for the VoiceXML 2.1 specification is growing, I could only identify one major platform that supports the capturing of caller utterances during a recognition — BeVocal. Section 7 of the VoiceXML 2.1 specification states that in order to enable recording during recognition, the value of the “recordutterance” property must be set to ‘true’. Several other platforms have something close (and may have changes in the pipeline to support this part of the 2.1 specification), but they won’t fit our needs here:

TellMe – the Tellme platform has a proprietary extension to the element called “tellme.field.recordutterance”. This extension does allow for the capturing of utterances when a recognition is performed. However, the TellMe document states that “The utterance is recorded regardless of whether the recognition succeeded or resulted in a nomatch.” While this is great if you are trying to fine tune a grammar, it won’t do for our example here.
Voxeo – the Voxeo platform has an extension for a proprietary element called “” that allows developers to designate a specific percentage of calls to be recorded for quality purposes. According to the Voxeo documentation, this element “allows the developer to record both sides of a call, recording the human and the application interaction to a wav file.” Again, this is probably very useful for analyzing application performance and the accuracy of recongitions, but it won’t work in this example.

A Subdialog for Confidence-Based Recognition

Structuring the confidence-based recognition example as a subdialog looks something like the example below.

<?xml version=“1.0“ ?>
<!DOCTYPE vxml PUBLIC “-//BeVocal Inc//VoiceXML 2.0//EN“
  “http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd“>
<vxml version=“2.0“ xmlns=“http://www.w3.org/2001/vxml“>

<!-- Event handler for noinput and nomatch events -->
<catch event=“noinput nomatch“>
I'm Sorry. I didn't get that.
<reprompt/>
</catch>

<!-- Threshold for confidence-based recognition-->
<var name=“thresh“ expr=“0.75“/>

<!-- Enable audio capture -->
<property name=“recordutterance“ value=“true“/>

<form id=“choose“>

<field name=“main“>
<prompt>Please select the candidate you want to vote for?</prompt>

<grammar src=“voteGrammar.grxml“/>

<filled>
<if cond=“main$.confidence & lt; thresh “>

<subdialog name=“sure“ src=“makeSure.vxml“>
  <param name=“vote“ expr=“main$.utterance“/>
</subdialog>
 <if cond=“sure.isRight=='true'“>
 <log>Voter confirmed selection for recongition less than
 <value expr=“100*(thresh)“/> percent.</log>
 <else/>
 <clear namelist=“main“/>
 <goto nextitem=“main“/>
 </if>
</if>
<prompt>Thank you.</prompt>
<submit next=“process.php“ namelist=“main“ method=“post“ enctype=“multipart/form-data“/>
</filled>

</field>
</form>

</vxml>

There are several important elements here. First, note that I use a stand alone variable to declare the level of confidence that I’ll be requiring on certain fields. This will make it easier to change the value across multiple fields if testing shows that a differnet confidence level is needed.

Within the element (which lets us inspect the result of a successful recognition on a field variable) we test to see whether the confidence level of the recognition meets our minimum threshold. If it does not meet our minimum threshold, we call the subdialog contained in the file “makeSure.vxml” and pass it a string representing the recognized utterance. (Keep in mind, the platform’s default recognition level is also at work here – if the confidence level of the recognition falls below this default threshold, a nomatch event will be thrown. For our test to be applied, the confidence level of the recognition must be at least this default threshold level.)

Without looking at the specifics of the subdialog just yet, we can see that if a value of ‘true’ is returned, we’ll log that the voter has confirmed their selection and we’ll then send their vote on for further processing. If any other value is returned from our subdialog, we’ll clear the field variable and start again. (The element in this part of the application may be a bit redundant – the Form Interpretation Algorithm should ensure another visit to the field called “main” after its value has been cleared.)

Now lets have a look at the subdialog that confirms a voter’s selection:

<?xml version=“1.0“ ?>
<!DOCTYPE vxml PUBLIC “-//BeVocal Inc//VoiceXML 2.0//EN“
  “http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd“>
<vxml version=“2.0“ xmlns=“http://www.w3.org/2001/vxml“>

<form id=“makeSure“>
<var name=“vote“/>
<var name=“isRight“/>

 <field name=“confirm“ type=“boolean“>
 <noinput>I'm Sorry.  I did not hear what you said.<reprompt/></noinput>
 <nomatch><reprompt/></nomatch>
   <prompt>I think I heard you vote for, <value expr=“vote“/>.  Is this correct?</prompt>

   <filled>
    <if cond=“confirm“>
    <assign name=“isRight“ expr=“'true'“/>
    <else/>
    <assign name=“isRight“ expr=“'false'“/>
    </if>
    <return namelist=“isRight“/>
   </filled>

 </field>
</form>

</vxml>

As you can see, this is a simple form that takes in the parameter passed from the previous dialog (“vote”). It uses a very simple boolean grammar (one of the builtin grammar types in VoiceXML) to confirm a voter’s selection. If a voter confirms the value passed to the subdialog, the variable “isRight” is set to ‘true’ and returned to the VoiceXML page that referenced the subdialog. Otherwise, the value of this variable is set to ‘false’.

What I like about this approach is that it makes it pretty straightforward to apply a higher level of confidence to specific fields within an application simply by applying a quick test on each successful recognition and making a reference to the subdialog. You don’t have to rewrite or copy and past a bunch of lines in different fields within the application.

Now lets look at our simple processing script – this is the document that is referenced in our main dialog that receives the value of the voters input after a sufficiently confident recognition has occurred.

<?php

// Get posted vote value
$_vote = htmlspecialchars($_REQUEST['main']);

echo '<?xml version=“1.0“?>';
echo “n“;

?>
<!DOCTYPE vxml PUBLIC “-//BeVocal Inc//VoiceXML 2.0//EN“
  “http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd“>
<vxml application=“confidenceTest.vxml“ version=“2.0“ xmlns=“http://www.w3.org/2001/vxml“>

<form id=“playBack“>
<block>
<prompt>
You voted for <?php echo $_vote ?>, <audio expr=“application.lastaudio$“/>.
<break size=“small“/>
Goodbye.
</prompt>
<disconnect/>
</block>
</form>

</vxml>

What’s important to note here is that I did not submit the audio of the captured utterance to the processing script. In order to do this, you need to use the [VoiceXML 2.1 tag](http://www.w3.org/TR/2005/CR-voicexml21-20050613/#sec-data). However, the submission of captured audio through the tag will not perform a transition to a new dialog, something we want to do to tell the voter who they selected (or, alternatively, that their vote was processed). So, I’ll take a bit of a different approach here and reference the variable “application.lastaudio$” – I also make a reference to our original dialog in the “application” attribute of the element.

The “application.lastaudio$” variable is a BeVocal extension that “contains an audio capture of the user’s speech” when it “matches a field grammar, a form grammar, a menu choice, or a link grammar.” Obviously this won’t work on other platforms, but as support for the VoiceXML 2.1 specification improves this approach can be modified to be platform independent.

Conclusion

Confidence-based recognition along with some of the new features of the VoiceXML 2.1 specification provide a very powerful mechanism for confirming user selections in voice applications. This type of approach is essential for application like telephone-based voting systems.

If you would like all of the files in this example (along with the grammar file used, but not discussed, here). You can get them all here. Any suggestions for improvement are welcome.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.