Use TTS when it is difficult or impossible to predict what the application will need to say
In IVR design, there is a distinction between bounded and unbounded text. Text is bounded when it is predictable, making it possible to play with pre-recorded voice segments. Unbounded text is unpredictable, e.g., future book or movie titles, or tomorrow's headlines and email messages.
For commercial IVR applications, when speech is bounded, designers should plan to use professionally recorded voice segments; when unbound, there is no choice other than to use TTS.
Another use of TTS is during early development, e.g., when building VoiceXML prototypes for initial usability testing. Really, that's another example of unbound text, because you haven't completely determined yet the final wording of the prompts and messages.
Avoid using TTS to refer to callers by name
Modern TTS synthesizers tend to have high pronunciation accuracy (e.g., ranging from 98.9% to 99.6% in Evans et al., 2006). An exception to this is the pronunciation of proper names.
Just as many call centers do in their scripting, there is often a desire to use customers’ names in IVR dialogs. In the United States, however, there are more than 2,000,000 surnames and more than 100,000 first names (Spiegel, 2003a), so it's attractive to plan to produce proper names using TTS rather than recording them all. English TTS engines have some difficulties interpreting proper names (Damper & Soonklang, 2007; Henton, 2003; Spiegel, 2003a, 2003b), for example, due to:
“I know of no research on the magnitude of the adverse social effects of mispronouncing customer names, but mispronunciation would certainly do nothing to improve the relationship between a customer and an enterprise” (Lewis, 2011). Estimates of pronunciation accuracy for proper names produced by general (untuned) TTS engines range from about 50% (Henton, 2003) to 60−70% (Damper & Soonklang, 2007) to 70−80% (Spiegel, 2003b).
As an aside, an automated system addressing a caller by name has another risk, that the person whose name is on the account isn't the one calling. For instance, a family's utilities may have been set up by one person, but it might be a different person in the household making the call. John Smith probably wouldn't be thrilled about being greeted as Mary. Although he would know what caused the problem, it just highlights the fact that he's talking to a “stupid machine.”
There are other uses for names other than addressing the caller. For insurance benefits, a family on a single plan makes sense, and while going through claims, stating the name on the claim makes sense. As does stating a doctor or provider's name. In these situations the caller's tolerance for mispronunciation will be higher than if you are addressing them directly.
If you must address the caller by name, be prepared to make a substantial investment in crafting this part of the design. Spiegel (2003a, 2003b) reported that after a 15-year research effort to improve proper name pronunciation, their specially tuned system had 99% correct pronunciation for common names and 92−94% for uncommon names. This shows that it is difficult but possible to solve the design problem of correctly pronouncing proper names.
Generally prefer concatenative to formant TTS
There are two distinct TTS technologies: formant (speech generation by rule) and concatenative (speech generation through the combination of snippets of recorded speech). Each has its advantages and corresponding disadvantages (Aaron, Eide, & Pitrelli, 2005; Klatt, 1987; Németh, Kiss, Zainkó, Olaszy, & Tóth, 2008).
Advantages of formant TTS include faster responsiveness, smaller footprint, and more independent control over speech characteristics such as speed and pitch.
The primary advantage of modern concatenative TTS is its more natural sound.
To achieve a high degree of pronunciation accuracy, TTS technologies require maintenance of exception dictionaries and input cleaning (e.g., is “Dr.” “doctor” or “drive”) – note that the specifics of exceptions and cleaning rules will vary as a function of language.
Minimize the number of juxtapositions between TTS and recorded speech
The research on the effects of mixing TTS and recorded speech is difficult to translate into specific guidelines (Lewis, 2011). It is common to use professionally recorded voice segments for all bounded text and TTS as a fallback for unbounded text. It can be jarring, however, if there are a large number of immediate juxtapositions between recorded speech and TTS (Lewis, Commarford, & Kotan, 2006; McInnes, Attwater, Edgington, Schmidt, & Jack, 1999; Spiegel, 1997). To keep the playback as smooth as possible, try to minimize the number of juxtapositions and to place them at pause points, even if that means playing a little more text using TTS than the absolute minimum.
Consider pre-generation of TTS files on a regular basis
Depending on architecture and products, there can be a bit of a lag in generating TTS on the fly. Suppose you have unbounded text that changes regularly but not real-time. It might be worthwhile to automate a process to periodically (say, nightly or weekly depending on the nature of the data) generate audio files for those messages that are ready to go.
Aaron, A., Eide, E., & Pitrelli, J. F. (2005). Conversational computers. Scientific American, 292(6), 64–69.
Damper, R. I., & Soonklang, T. (2007). Subjective evaluation of techniques for proper name pronunciation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2213-2221.
Evans, D. G., Draffan, E. A., James, A., & Blenkhorn, P. (2006). Do text-to-speech synthesizers pronounce correctly? A preliminary study. In K. Miesenberger et al. (Eds.), Proceedings of ICCHP (pp. 855–862). Berlin, Germany: Springer-Verlag.
Henton, C. (2003). The name game: Pronunciation puzzles for TTS. Speech Technology, 8(5), 32-35.
Klatt, D. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82, 737–793.
Lewis, J. R. (2011). Practical speech user interface design. Boca Raton, FL: CRC Press, Taylor & Francis Group.
Lewis, J. R., Commarford, P. M., & Kotan, C. (2006). Web-based comparison of two styles of auditory presentation: All TTS versus rapidly mixed TTS and recordings. In Proceedings of the Human Factors and Ergonomics Society 50th annual meeting (pp. 723–727). Santa Monica, CA: Human Factors and Ergonomics Society.
McInnes, F., Attwater, D., Edgington, M. D., Schmidt, M. S., & Jack, M. A. (1999). User attitudes to concatenated natural speech and text-to-speech synthesis in an automated information service. In Proceedings of Eurospeech99 (pp. 831–834). Budapest, Hungary: ESCA.
Németh, G., Kiss, G., Zainkó, C., Olaszy, G., & Tóth, B. (2008). Speech generation in mobile phones. In D. Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems (2nd ed.) (pp. 163–191). New York, NY: Springer.
Spiegel, M. F. (1997). Advanced database preprocessing and preparations that enable telecommunication services based on speech synthesis. Speech Communication, 23, 51–62.
Spiegel, M. F. (2003a). Proper name pronunciations for speech technology applications. International Journal of Speech Technology, 6, 419-427.
Spiegel, M. F. (2003b). The difficulties with names: Overcoming barriers to personal voice services. Speech Technology, 8(3), 12-15.