5.3. Text-to-Speech (TTS)

Text-to-speech is simply the conversion of written text into a spoken word, using speech synthesis. In our Asterisk system this means that an external program generates a sound file using a given text file (usually in ASCII format) as the source. The resulting sound file is played back as any other sound file would be, and the caller hears the text spoken out.
Quality of text-to-speech engines vary widely. As a rule-of-thumb, the open source engines are not as sophisticated as the commercial ones.

Tip

Sometimes you can test high quality engines through web portals. IBM offers a test portal for its TTS engine at http://www.ibm.com/software/pervasive/tech/demos/tts.shtml.
The TTS engine Festival (http://www.cstr.ed.ac.uk/projects/festival/) is a widely used open source version, but the voices included with it often lack the quality necessary for professional implementation, particularly if you need voices in languages other than English. Many Asterisk developers use the engine and voices sold commercially by Cepstral (http://www.cepstral.com/). As of this writing, the pricing was reasonable.[26]The solution described here builds on the Cepstral engine.[27]

Installating Cepstral Text-to-Speech

Download the voice from http://www.cepstral.com/downloads/. The file (Cepstral_David_i386-linux_4.2.0.tar.gz in this example) is installed with the following commands:
tar xvzf Cepstral_David_i386-linux_4.2.0.tar.gz
cd Cepstral_David_i386-linux_4.2.0.tar.gz
./install

Examples and tests

The engine installs to /opt/swift/bin/swift unless otherwise specified. You can test the installation from the command line as follows:
/opt/swift/bin/swift -o /tmp/test.wav -p audio/sampling-rate=8000,audio/channels=1 "This is a test."
You can play the resulting file with any audio player, or through Asterisk. To do this, just add a few lines to extensions.conf:
exten => 1234,1,Answer()
exten => 1234,2,Playback(/tmp/test)
exten => 1234,3,Hangup()
To generate some speech output from within Asterisk, we use the System() application in the dialplan. Here is an example:
exten => 1222,1,Answer()
exten => 1222,2,System(rm -rf /tmp/test.wav)
exten => 1222,3,System(/opt/swift/bin/swift -o /tmp/test.wav -p audio/sampling-rate=8000,audio/channels=1 "Another test.")
exten => 1222,4,Playback(/tmp/test)
exten => 1222,5,Hangup()

Pauses in text

Cepstral uses SSML (Speech Synthesis Markup Language) in its engine. You can add speech pauses to the output by specifying them as in this example:
exten => 1222,1,Answer()
exten => 1222,2,System(rm -rf /tmp/test.wav)
exten => 1222,3,System(/opt/swift/bin/swift -o /tmp/test.wav -p audio/sampling-rate=8000,audio/channels=1 "Another test. <break time='2500ms'/> Done!")
exten => 1222,4,Playback(/tmp/test.wav)
exten => 1222,5,Hangup()
Learn more about the SSML standard at http://www.w3.org/TR/speech-synthesis/.


[26] Like IBM, Cepstral has a demo portal at http://www.cepstral.com/demos/.

[27] For those who have worked with Festival before, these instructions are easily modified to work with it. This applies to other TTS engines as well. The implementation model is the same.