meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
prosody [2019/08/07 14:21]
lisa.illgen_concentrix.com Added Anchor Links
prosody [2019/08/08 10:29] (current)
lisa.illgen_concentrix.com
Line 8: Line 8:
 Pauses are the "white space" of auditory design and guide turn-taking. "The right word may be effective, but no word was ever as effective as a rightly timed pause" (M. Twain). Pauses are the "white space" of auditory design and guide turn-taking. "The right word may be effective, but no word was ever as effective as a rightly timed pause" (M. Twain).
  
-Effective turn-taking drives effective conversation. In many cultures, an important cue that one person in a dialog has finished talking and expects a response from the other conversant is for that person to stop talking ([[references# ​BAL2|Balentine, 2006]]; [[references#​BEA1|Beattie & Barnard, 1979]]; [[references#​HEI1|Heins et al., 1997]]; [[references#​JOH1|Johnstone et al., 1994]]; [[references#​MAR1|Margulies, 2005]]; [[references#​ROB1|Roberts et al., 2006]]; [[references#​STI1|Stivers et al., 2009]]; [[references#​WIL3|Wilson & Zimmerman, 1986]]). That isn’t to say that one speaker never barges into another, but if someone who has been talking stops and waits, it’s pretty clear that they’ve given up the conversational floor. Research on the timing of turn-taking during American English service-based conversations over the phone indicates that pauses shorter than 250 ms will rarely trigger turn-taking,​ whereas pauses longer than 1300 ms will almost certainly induce the non-speaking conversant to take the floor ([[references#​BEA1|Beattie & Barnard, 1979]]; [[references#​COM1|Commarford & Lewis, 2005]]). If this does not happen, it is indicative of a conversational problem requiring repair ([[references#​ROB1|Roberts et al., 2006]]).+Effective turn-taking drives effective conversation. In many cultures, an important cue that one person in a dialog has finished talking and expects a response from the other conversant is for that person to stop talking ([[references#​balentine2006|Balentine, 2006]]; [[references#​beattie|Beattie & Barnard, 1979]]; [[references#​heins|Heins et al., 1997]]; [[references#​johnstone|Johnstone et al., 1994]]; [[references#​margulies2005|Margulies, 2005]]; [[references#​roberts|Roberts et al., 2006]]; [[references#​stivers|Stivers et al., 2009]]; [[references#​wilson|Wilson & Zimmerman, 1986]]). That isn’t to say that one speaker never barges into another, but if someone who has been talking stops and waits, it’s pretty clear that they’ve given up the conversational floor. Research on the timing of turn-taking during American English service-based conversations over the phone indicates that pauses shorter than 250 ms will rarely trigger turn-taking,​ whereas pauses longer than 1300 ms will almost certainly induce the non-speaking conversant to take the floor ([[references#​beattie|Beattie & Barnard, 1979]]; [[references#​commarford|Commarford & Lewis, 2005]]). If this does not happen, it is indicative of a conversational problem requiring repair ([[references#​roberts|Roberts et al., 2006]]).
  
 **// Turn-taking pauses should be about 750 ms //**\\ **// Turn-taking pauses should be about 750 ms //**\\
Line 17: Line 17:
  
 **// Set standard no input timeouts to 3-7 seconds //**\\ **// Set standard no input timeouts to 3-7 seconds //**\\
-For no input events (the system has stopped talking and is waiting for the user to pick up the conversation),​ the VoiceXML default of 7 seconds seems to work well in practice – shortening it to as little as 5 or 3 seconds also appears to work ([[references#​MAR1|Margulies, 2005]]; [[references#​YUS1|Yuschik, 2008]]). For special populations (non-native speakers, older adults) or tasks (getting a credit card number, performing steps to activate a cell phone), it’s reasonable to provide a longer timeout, anywhere from 10-30 seconds, ([[references#​DUL1|Dulude, 2002]]) or to provide a pause/​resume capability.+For no input events (the system has stopped talking and is waiting for the user to pick up the conversation),​ the VoiceXML default of 7 seconds seems to work well in practice – shortening it to as little as 5 or 3 seconds also appears to work ([[references#​margulies2005|Margulies, 2005]]; [[references#​yuschik|Yuschik, 2008]]). For special populations (non-native speakers, older adults) or tasks (getting a credit card number, performing steps to activate a cell phone), it’s reasonable to provide a longer timeout, anywhere from 10-30 seconds, ([[references#​dulude|Dulude, 2002]]) or to provide a pause/​resume capability.
  
 ==== Pauses within prompts ==== ==== Pauses within prompts ====
Line 25: Line 25:
  
 **// Pause at least 500 ms between options in a menu //**\\ **// Pause at least 500 ms between options in a menu //**\\
-Pauses should be inserted between menu items to allow for that processing. 500 ms between options is a good place to start. Menu items that are super short, clear, and distinct may not require 500 ms in between. Anecdotal evidence shows that menu items that are longer and more complex may take more time, say 750-1000 ms. A controlled study with the same menus executed with 250, 500, 750, and 1000 ms ([[references#​MCK2|McKienzie, 2009]]) showed that 250 was definitely too short (more error conditions) but that the other three were more or less a toss-up. A DTMF system may actually require more processing time, as for each menu item the caller has to store both the description of the choice and its corresponding number.+Pauses should be inserted between menu items to allow for that processing. 500 ms between options is a good place to start. Menu items that are super short, clear, and distinct may not require 500 ms in between. Anecdotal evidence shows that menu items that are longer and more complex may take more time, say 750-1000 ms. A controlled study with the same menus executed with 250, 500, 750, and 1000 ms ([[references#​mckienzie|McKienzie, 2009]]) showed that 250 was definitely too short (more error conditions) but that the other three were more or less a toss-up. A DTMF system may actually require more processing time, as for each menu item the caller has to store both the description of the choice and its corresponding number.
  
 These pauses should be done as post-processing of the recorded prompts. Let the voice talent speak with natural pauses (which are often 150-250 ms), then insert extra silence to achieve the desired pause duration. These pauses should be done as post-processing of the recorded prompts. Let the voice talent speak with natural pauses (which are often 150-250 ms), then insert extra silence to achieve the desired pause duration.
Line 54: Line 54:
 Why can't you just record each word that you plan to use in an application,​ then just join them (concatenate) as needed? Because coarticulation is the enemy of concatenation. Why can't you just record each word that you plan to use in an application,​ then just join them (concatenate) as needed? Because coarticulation is the enemy of concatenation.
  
-In natural, continuous speech, the tongue and lips (articulators) approach but do not reach the final positions necessary to produce "​perfect"​ speech. If the articulators did reach their target positions, however, the resulting speech would sound unnatural (hyperarticulated) and would be much slower than natural speech. For this reason, the actual sound of any given phoneme depends on the phonemes that surround it, resulting in [[https://​en.wikipedia.org/​wiki/​Coarticulation | coarticulation]] . One of the amazing processes of language is how our brains unravel this and hear phonemes as discrete categories of sound ([[references#​LIB1|Liberman et al., 1957]]). A consequence of how our brains have evolved to untangle coarticulation is that when snippets of audio that were recorded in different contexts are concatenated,​ they sound unnatural and jarring. An understanding of the concept of coarticulation is very important when planning the recorded output of a speech application.+In natural, continuous speech, the tongue and lips (articulators) approach but do not reach the final positions necessary to produce "​perfect"​ speech. If the articulators did reach their target positions, however, the resulting speech would sound unnatural (hyperarticulated) and would be much slower than natural speech. For this reason, the actual sound of any given phoneme depends on the phonemes that surround it, resulting in [[https://​en.wikipedia.org/​wiki/​Coarticulation | coarticulation]] . One of the amazing processes of language is how our brains unravel this and hear phonemes as discrete categories of sound ([[references#​liberman|Liberman et al., 1957]]). A consequence of how our brains have evolved to untangle coarticulation is that when snippets of audio that were recorded in different contexts are concatenated,​ they sound unnatural and jarring. An understanding of the concept of coarticulation is very important when planning the recorded output of a speech application.
  
 *// Use natural pause points to minimize coarticulation effects //**\\ *// Use natural pause points to minimize coarticulation effects //**\\