Audio Control in VoiceXML

One of the questions I hear most from VoiceXML developers relates to audio control features in VoiceXML. VoiceXML does not natively support the ability to “rewind” and “fast forward” through audio files, but some vendors provide this functionality as extensions on their platforms.

While this is nice, I like to keep my code as portable as possible. Given that VoiceXML platforms are required to support the WAV audio format and this format is generally well understood, it should be relatively easy to construct server-side logic to simulate “rewind” and “fast forward” functionality.

For this example, I use PHP but I’m confident that it could be replicated in many other languages. There are several key ingredients to this example:

  • An understanding of the layout of a WAV file. There are two portions that we are most interested in – the Subchunk2Size (which contains information on the size of the audio file) and audio data itself.
  • A platform that supports the VoiceXML 2.1 specification. If we’re going to implement these features, we need to be able to determine where in the playback of an audio file a caller issues a command. The element does this nicely, and it is supported on the Voxeo Prophecy platform.

The approach is simple. Using the element, and a preset interval for moving backwards and forwards (I use a setting of 5 seconds in this example) we “rebuild” the audio file each time a user issues a command. A PHP script takes in a parameter for the offset and reads in the source file from that point, and then outputs the contents of the audio file from that point.

For example, if a caller listens to an audio file for 10 seconds and then issues a command to “fast forward” we need to have our audio file begin playing at a point 15 seconds from the beginning of the file (10 + 5 = 15).

If the caller then listens to the file for 10 more seconds, and issues a command to rewind, we need to play the file starting at an offset of 20 seconds (15 + 10 = 25; 25 – 5 = 20). Using the PHP “fread” and “unpack” functions, we can read in the contents of an audio file on the fly, based on how long the caller has listened to the file and the command they have issued.

A pair of very simple scripts can be downloaded here. Please bear in mind that these are still very rough, but they do demonstrate the basic idea.

Any comments or suggestions are welcome.

Note — this example is built on ideas and code originally obtained from Snippler. The example that I used can be found here.