meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
tts [2019/08/07 15:22]
lisa.illgen_concentrix.com Added Anchor Links
tts [2019/08/07 16:26] (current)
lisa.illgen_concentrix.com
Line 9: Line 9:
  
 **//Avoid using TTS to refer to callers by name //**\\ **//Avoid using TTS to refer to callers by name //**\\
-Modern TTS synthesizers tend to have high pronunciation accuracy (e.g., ranging from 98.9% to 99.6% in [[references#​EVA1|Evans et al., 2006]]). An exception to this is the pronunciation of proper names.+Modern TTS synthesizers tend to have high pronunciation accuracy (e.g., ranging from 98.9% to 99.6% in [[references#​evans|Evans et al., 2006]]). An exception to this is the pronunciation of proper names.
  
-Just as many call centers do in their scripting, there is often a desire to use customers’ names in IVR dialogs. In the United States, however, there are more than 2,000,000 surnames and more than 100,000 first names ([[references#​SPI2|Spiegel, 2003a]]), so it's attractive to plan to produce proper names using TTS rather than recording them all. English TTS engines have some difficulties interpreting proper names ([[references#​DAM2|Damper & Soonklang, 2007]]; [[references#​HEN1|Henton, 2003]]; Spiegel, [[references#​SPI2|2003a]], [[references#​SPI3|2003b]]), for example, due to:+Just as many call centers do in their scripting, there is often a desire to use customers’ names in IVR dialogs. In the United States, however, there are more than 2,000,000 surnames and more than 100,000 first names ([[references#​spiegel2003a|Spiegel, 2003a]]), so it's attractive to plan to produce proper names using TTS rather than recording them all. English TTS engines have some difficulties interpreting proper names ([[references#​dampers2007|Damper & Soonklang, 2007]]; [[references#​henton|Henton, 2003]]; Spiegel, [[references#​spiegel2003a|2003a]], [[references#​spiegel2003b|2003b]]), for example, due to:
  
     * Uncommon patterns of English letter sequences     * Uncommon patterns of English letter sequences
Line 18: Line 18:
     * Product names that contain symbols other than English letters     * Product names that contain symbols other than English letters
  
-"I know of no research on the magnitude of the adverse social effects of mispronouncing customer names, but mispronunciation would certainly do nothing to improve the relationship between a customer and an enterprise"​ ([[references#​LEW6|Lewis, 2011]]). Estimates of pronunciation accuracy for proper names produced by general (untuned) TTS engines range from about 50% (Henton, 2003) to 60−70% ([[references#​DAM2|Damper & Soonklang, 2007]]) to 70−80% ([[references#​SPI3|Spiegel, 2003b]]).+"I know of no research on the magnitude of the adverse social effects of mispronouncing customer names, but mispronunciation would certainly do nothing to improve the relationship between a customer and an enterprise"​ ([[references#​lewis2011|Lewis, 2011]]). Estimates of pronunciation accuracy for proper names produced by general (untuned) TTS engines range from about 50% ([[references#​henton|Henton, 2003]]) to 60−70% ([[references#​dampers2007|Damper & Soonklang, 2007]]) to 70−80% ([[references#​spiegel2003b|Spiegel, 2003b]]).
  
 As an aside, an automated system addressing a caller by name has another risk, that the person whose name is on the account isn't the one calling. For instance, a family'​s utilities may have been set up by one person, but it might be a different person in the household making the call. John Smith probably wouldn'​t be thrilled about being greeted as Mary. Although he would know what caused the problem, it just highlights the fact that he's talking to a "​stupid machine."​ As an aside, an automated system addressing a caller by name has another risk, that the person whose name is on the account isn't the one calling. For instance, a family'​s utilities may have been set up by one person, but it might be a different person in the household making the call. John Smith probably wouldn'​t be thrilled about being greeted as Mary. Although he would know what caused the problem, it just highlights the fact that he's talking to a "​stupid machine."​
Line 24: Line 24:
 There are other uses for names other than addressing the caller. For insurance benefits, a family on a single plan makes sense, and while going through claims, stating the name on the claim makes sense. As does stating a doctor or provider'​s name. In these situations the caller'​s tolerance for mispronunciation will be higher than if you are addressing them directly. There are other uses for names other than addressing the caller. For insurance benefits, a family on a single plan makes sense, and while going through claims, stating the name on the claim makes sense. As does stating a doctor or provider'​s name. In these situations the caller'​s tolerance for mispronunciation will be higher than if you are addressing them directly.
  
-If you must address the caller by name, be prepared to make a substantial investment in crafting this part of the design. Spiegel ([[references#​SPI2|2003a]], [[references#​SPI3|2003b]]) reported that after a 15-year research effort to improve proper name pronunciation,​ their specially tuned system had 99% correct pronunciation for common names and 92−94% for uncommon names. This shows that it is difficult but possible to solve the design problem of correctly pronouncing proper names.+If you must address the caller by name, be prepared to make a substantial investment in crafting this part of the design. Spiegel ([[references#​spiegel2003a|2003a]], [[references#​spiegel2003b|2003b]]) reported that after a 15-year research effort to improve proper name pronunciation,​ their specially tuned system had 99% correct pronunciation for common names and 92−94% for uncommon names. This shows that it is difficult but possible to solve the design problem of correctly pronouncing proper names.
  
 ==== How to use TTS ==== ==== How to use TTS ====
 **// Generally prefer concatenative to formant TTS //**\\ **// Generally prefer concatenative to formant TTS //**\\
-There are two distinct TTS technologies:​ formant (speech generation by rule) and concatenative (speech generation through the combination of snippets of recorded speech). Each has its advantages and corresponding disadvantages ([[references#​AAR1|Aaron, Eide, & Pitrelli, 2005]]; [[references#​KLA1|Klatt, 1987]]; [[references#​NEM1|Németh, Kiss, Zainkó, Olaszy, & Tóth, 2008]]).+There are two distinct TTS technologies:​ formant (speech generation by rule) and concatenative (speech generation through the combination of snippets of recorded speech). Each has its advantages and corresponding disadvantages ([[references#​aaron|Aaron, Eide, & Pitrelli, 2005]]; [[references#​klatt|Klatt, 1987]]; [[references#​németh|Németh, Kiss, Zainkó, Olaszy, & Tóth, 2008]]).
  
 Advantages of formant TTS include faster responsiveness,​ smaller footprint, and more independent control over speech characteristics such as speed and pitch. Advantages of formant TTS include faster responsiveness,​ smaller footprint, and more independent control over speech characteristics such as speed and pitch.
Line 37: Line 37:
  
 **// Minimize the number of juxtapositions between TTS and recorded speech //**\\ **// Minimize the number of juxtapositions between TTS and recorded speech //**\\
-The research on the effects of mixing TTS and recorded speech is difficult to translate into specific guidelines ([[references#​LEW6|Lewis, 2011]]). It is common to use professionally recorded voice segments for all bounded text and TTS as a fallback for unbounded text. It can be jarring, however, if there are a large number of immediate juxtapositions between recorded speech and TTS ([[references#​LEW10|Lewis, Commarford, & Kotan, 2006]]; [[references#​MCI1| McInnes, Attwater, Edgington, Schmidt, & Jack, 1999]]; [[references#​SPI1|Spiegel, 1997]]). To keep the playback as smooth as possible, try to minimize the number of juxtapositions and to place them at pause points, even if that means playing a little more text using TTS than the absolute minimum.+The research on the effects of mixing TTS and recorded speech is difficult to translate into specific guidelines ([[references#​lewis2011|Lewis, 2011]]). It is common to use professionally recorded voice segments for all bounded text and TTS as a fallback for unbounded text. It can be jarring, however, if there are a large number of immediate juxtapositions between recorded speech and TTS ([[references#​lewisc2006|Lewis, Commarford, & Kotan, 2006]]; [[references#​mcinnesa1999| McInnes, Attwater, Edgington, Schmidt, & Jack, 1999]]; [[references#​spiegel1997|Spiegel, 1997]]). To keep the playback as smooth as possible, try to minimize the number of juxtapositions and to place them at pause points, even if that means playing a little more text using TTS than the absolute minimum.
  
 **// Consider pre-generation of TTS files on a regular basis //**\\ **// Consider pre-generation of TTS files on a regular basis //**\\