meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
tts [2018/06/19 19:29]
miket_forty7ronin.com created
tts [2019/08/07 16:26] (current)
lisa.illgen_concentrix.com
Line 9: Line 9:
  
 **//Avoid using TTS to refer to callers by name //**\\ **//Avoid using TTS to refer to callers by name //**\\
-Modern TTS synthesizers tend to have high pronunciation accuracy (e.g., ranging from 98.9% to 99.6% in Evans et al., 2006). An exception to this is the pronunciation of proper names.+Modern TTS synthesizers tend to have high pronunciation accuracy (e.g., ranging from 98.9% to 99.6% in [[references#​evans|Evans et al., 2006]]). An exception to this is the pronunciation of proper names.
  
-Just as many call centers do in their scripting, there is often a desire to use customers’ names in IVR dialogs. In the United States, however, there are more than 2,000,000 surnames and more than 100,000 first names (Spiegel, 2003a), so it's attractive to plan to produce proper names using TTS rather than recording them all. English TTS engines have some difficulties interpreting proper names (Damper & Soonklang, 2007; Henton, 2003; Spiegel, 2003a, 2003b), for example, due to:+Just as many call centers do in their scripting, there is often a desire to use customers’ names in IVR dialogs. In the United States, however, there are more than 2,000,000 surnames and more than 100,000 first names ([[references#​spiegel2003a|Spiegel, 2003a]]), so it's attractive to plan to produce proper names using TTS rather than recording them all. English TTS engines have some difficulties interpreting proper names ([[references#​dampers2007|Damper & Soonklang, 2007]][[references#​henton|Henton, 2003]]; Spiegel, ​[[references#​spiegel2003a|2003a]][[references#​spiegel2003b|2003b]]), for example, due to:
  
     * Uncommon patterns of English letter sequences     * Uncommon patterns of English letter sequences
Line 18: Line 18:
     * Product names that contain symbols other than English letters     * Product names that contain symbols other than English letters
  
-"I know of no research on the magnitude of the adverse social effects of mispronouncing customer names, but mispronunciation would certainly do nothing to improve the relationship between a customer and an enterprise"​ (Lewis, 2011). Estimates of pronunciation accuracy for proper names produced by general (untuned) TTS engines range from about 50% (Henton, 2003) to 60−70% (Damper & Soonklang, 2007) to 70−80% (Spiegel, 2003b).+"I know of no research on the magnitude of the adverse social effects of mispronouncing customer names, but mispronunciation would certainly do nothing to improve the relationship between a customer and an enterprise"​ ([[references#​lewis2011|Lewis, 2011]]). Estimates of pronunciation accuracy for proper names produced by general (untuned) TTS engines range from about 50% ([[references#​henton|Henton, 2003]]) to 60−70% ([[references#​dampers2007|Damper & Soonklang, 2007]]) to 70−80% ([[references#​spiegel2003b|Spiegel, 2003b]]).
  
 As an aside, an automated system addressing a caller by name has another risk, that the person whose name is on the account isn't the one calling. For instance, a family'​s utilities may have been set up by one person, but it might be a different person in the household making the call. John Smith probably wouldn'​t be thrilled about being greeted as Mary. Although he would know what caused the problem, it just highlights the fact that he's talking to a "​stupid machine."​ As an aside, an automated system addressing a caller by name has another risk, that the person whose name is on the account isn't the one calling. For instance, a family'​s utilities may have been set up by one person, but it might be a different person in the household making the call. John Smith probably wouldn'​t be thrilled about being greeted as Mary. Although he would know what caused the problem, it just highlights the fact that he's talking to a "​stupid machine."​
Line 24: Line 24:
 There are other uses for names other than addressing the caller. For insurance benefits, a family on a single plan makes sense, and while going through claims, stating the name on the claim makes sense. As does stating a doctor or provider'​s name. In these situations the caller'​s tolerance for mispronunciation will be higher than if you are addressing them directly. There are other uses for names other than addressing the caller. For insurance benefits, a family on a single plan makes sense, and while going through claims, stating the name on the claim makes sense. As does stating a doctor or provider'​s name. In these situations the caller'​s tolerance for mispronunciation will be higher than if you are addressing them directly.
  
-If you must address the caller by name, be prepared to make a substantial investment in crafting this part of the design. Spiegel (2003a, 2003b) reported that after a 15-year research effort to improve proper name pronunciation,​ their specially tuned system had 99% correct pronunciation for common names and 92−94% for uncommon names. This shows that it is difficult but possible to solve the design problem of correctly pronouncing proper names.+If you must address the caller by name, be prepared to make a substantial investment in crafting this part of the design. Spiegel ([[references#​spiegel2003a|2003a]][[references#​spiegel2003b|2003b]]) reported that after a 15-year research effort to improve proper name pronunciation,​ their specially tuned system had 99% correct pronunciation for common names and 92−94% for uncommon names. This shows that it is difficult but possible to solve the design problem of correctly pronouncing proper names.
  
 ==== How to use TTS ==== ==== How to use TTS ====
 **// Generally prefer concatenative to formant TTS //**\\ **// Generally prefer concatenative to formant TTS //**\\
-There are two distinct TTS technologies:​ formant (speech generation by rule) and concatenative (speech generation through the combination of snippets of recorded speech). Each has its advantages and corresponding disadvantages (Aaron, Eide, & Pitrelli, 2005; Klatt, 1987; Németh, Kiss, Zainkó, Olaszy, & Tóth, 2008).+There are two distinct TTS technologies:​ formant (speech generation by rule) and concatenative (speech generation through the combination of snippets of recorded speech). Each has its advantages and corresponding disadvantages ([[references#​aaron|Aaron, Eide, & Pitrelli, 2005]][[references#​klatt|Klatt, 1987]][[references#​németh|Németh, Kiss, Zainkó, Olaszy, & Tóth, 2008]]).
  
 Advantages of formant TTS include faster responsiveness,​ smaller footprint, and more independent control over speech characteristics such as speed and pitch. Advantages of formant TTS include faster responsiveness,​ smaller footprint, and more independent control over speech characteristics such as speed and pitch.
Line 37: Line 37:
  
 **// Minimize the number of juxtapositions between TTS and recorded speech //**\\ **// Minimize the number of juxtapositions between TTS and recorded speech //**\\
-The research on the effects of mixing TTS and recorded speech is difficult to translate into specific guidelines (Lewis, 2011). It is common to use professionally recorded voice segments for all bounded text and TTS as a fallback for unbounded text. It can be jarring, however, if there are a large number of immediate juxtapositions between recorded speech and TTS (Lewis, Commarford, & Kotan, 2006; McInnes, Attwater, Edgington, Schmidt, & Jack, 1999; Spiegel, 1997). To keep the playback as smooth as possible, try to minimize the number of juxtapositions and to place them at pause points, even if that means playing a little more text using TTS than the absolute minimum.+The research on the effects of mixing TTS and recorded speech is difficult to translate into specific guidelines ([[references#​lewis2011|Lewis, 2011]]). It is common to use professionally recorded voice segments for all bounded text and TTS as a fallback for unbounded text. It can be jarring, however, if there are a large number of immediate juxtapositions between recorded speech and TTS ([[references#​lewisc2006|Lewis, Commarford, & Kotan, 2006]][[references#​mcinnesa1999| ​McInnes, Attwater, Edgington, Schmidt, & Jack, 1999]][[references#​spiegel1997|Spiegel, 1997]]). To keep the playback as smooth as possible, try to minimize the number of juxtapositions and to place them at pause points, even if that means playing a little more text using TTS than the absolute minimum.
  
 **// Consider pre-generation of TTS files on a regular basis //**\\ **// Consider pre-generation of TTS files on a regular basis //**\\