Differences

This shows you the differences between two versions of the page.

--- prosody [2018/08/21 11:31]
127.0.0.1 external edit
+++ prosody [2019/08/08 10:29] (current)
lisa.illgen_concentrix.com
@@ Line 2: / Line 2: @@
 In linguistics, prosody (pronounced /ˈprɒsədi/ pross-ə-dee, from Greek προσῳδία, prosōidía, [prosɔːdía], “song sung to music; pronunciation of syllable”) is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of language that may not be encoded by grammar or choice of vocabulary.
+{{tag>Editing}}
 ==== Turn-taking considerations ====
 **// Time pauses to encourage turn-taking at the appropriate times //**\\
 Pauses are the "white space" of auditory design and guide turn-taking. "The right word may be effective, but no word was ever as effective as a rightly timed pause" (M. Twain).
-Effective turn-taking drives effective conversation. In many cultures, an important cue that one person in a dialog has finished talking and expects a response from the other conversant is for that person to stop talking (Balentine, 2006; Beattie & Barnard, 1979; Heins et al., 1997; Johnstone et al., 1994; Margulies, 2005; Roberts et al., 2006; Stivers et al., 2009; Wilson & Zimmerman, 1986). That isn’t to say that one speaker never barges into another, but if someone who has been talking stops and waits, it’s pretty clear that they’ve given up the conversational floor. Research on the timing of turn-taking during American English service-based conversations over the phone indicates that pauses shorter than 250 ms will rarely trigger turn-taking, whereas pauses longer than 1300 ms will almost certainly induce the non-speaking conversant to take the floor (Beattie & Barnard, 1979; Commarford & Lewis, 2005). If this does not happen, it is indicative of a conversational problem requiring repair (Roberts et al., 2006).
+Effective turn-taking drives effective conversation. In many cultures, an important cue that one person in a dialog has finished talking and expects a response from the other conversant is for that person to stop talking ([[references#balentine2006|Balentine, 2006]]; [[references#beattie|Beattie & Barnard, 1979]]; [[references#heins|Heins et al., 1997]]; [[references#johnstone|Johnstone et al., 1994]]; [[references#margulies2005|Margulies, 2005]]; [[references#roberts|Roberts et al., 2006]]; [[references#stivers|Stivers et al., 2009]]; [[references#wilson|Wilson & Zimmerman, 1986]]). That isn’t to say that one speaker never barges into another, but if someone who has been talking stops and waits, it’s pretty clear that they’ve given up the conversational floor. Research on the timing of turn-taking during American English service-based conversations over the phone indicates that pauses shorter than 250 ms will rarely trigger turn-taking, whereas pauses longer than 1300 ms will almost certainly induce the non-speaking conversant to take the floor ([[references#beattie|Beattie & Barnard, 1979]]; [[references#commarford|Commarford & Lewis, 2005]]). If this does not happen, it is indicative of a conversational problem requiring repair ([[references#roberts|Roberts et al., 2006]]).
 **// Turn-taking pauses should be about 750 ms //**\\
@@ Line 15: / Line 17: @@
 **// Set standard no input timeouts to 3-7 seconds //**\\
-For no input events (the system has stopped talking and is waiting for the user to pick up the conversation), the VoiceXML default of 7 seconds seems to work well in practice – shortening it to as little as 5 or 3 seconds also appears to work (Margulies, 2005; Yuschik, 2008). For special populations (non-native speakers, older adults) or tasks (getting a credit card number, performing steps to activate a cell phone), it’s reasonable to provide a longer timeout, anywhere from 10-30 seconds, (Dulude, 2002) or to provide a pause/resume capability.
+For no input events (the system has stopped talking and is waiting for the user to pick up the conversation), the VoiceXML default of 7 seconds seems to work well in practice – shortening it to as little as 5 or 3 seconds also appears to work ([[references#margulies2005|Margulies, 2005]]; [[references#yuschik|Yuschik, 2008]]). For special populations (non-native speakers, older adults) or tasks (getting a credit card number, performing steps to activate a cell phone), it’s reasonable to provide a longer timeout, anywhere from 10-30 seconds, ([[references#dulude|Dulude, 2002]]) or to provide a pause/resume capability.
 ==== Pauses within prompts ====
@@ Line 23: / Line 25: @@
 **// Pause at least 500 ms between options in a menu //**\\
-Pauses should be inserted between menu items to allow for that processing. 500 ms between options is a good place to start. Menu items that are super short, clear, and distinct may not require 500 ms in between. Anecdotal evidence shows that menu items that are longer and more complex may take more time, say 750-1000 ms. A controlled study with the same menus executed with 250, 500, 750, and 1000 ms (McKienzie, 2009) showed that 250 was definitely too short (more error conditions) but that the other three were more or less a toss-up. A DTMF system may actually require more processing time, as for each menu item the caller has to store both the description of the choice and its corresponding number.
+Pauses should be inserted between menu items to allow for that processing. 500 ms between options is a good place to start. Menu items that are super short, clear, and distinct may not require 500 ms in between. Anecdotal evidence shows that menu items that are longer and more complex may take more time, say 750-1000 ms. A controlled study with the same menus executed with 250, 500, 750, and 1000 ms ([[references#mckienzie|McKienzie, 2009]]) showed that 250 was definitely too short (more error conditions) but that the other three were more or less a toss-up. A DTMF system may actually require more processing time, as for each menu item the caller has to store both the description of the choice and its corresponding number.
 These pauses should be done as post-processing of the recorded prompts. Let the voice talent speak with natural pauses (which are often 150-250 ms), then insert extra silence to achieve the desired pause duration.
@@ Line 52: / Line 54: @@
 Why can't you just record each word that you plan to use in an application, then just join them (concatenate) as needed? Because coarticulation is the enemy of concatenation.
-In natural, continuous speech, the tongue and lips (articulators) approach but do not reach the final positions necessary to produce "perfect" speech. If the articulators did reach their target positions, however, the resulting speech would sound unnatural (hyperarticulated) and would be much slower than natural speech. For this reason, the actual sound of any given phoneme depends on the phonemes that surround it, resulting in [[https://en.wikipedia.org/wiki/Coarticulation | coarticulation]] . One of the amazing processes of language is how our brains unravel this and hear phonemes as discrete categories of sound (Liberman et al., 1957). A consequence of how our brains have evolved to untangle coarticulation is that when snippets of audio that were recorded in different contexts are concatenated, they sound unnatural and jarring. An understanding of the concept of coarticulation is very important when planning the recorded output of a speech application.
+In natural, continuous speech, the tongue and lips (articulators) approach but do not reach the final positions necessary to produce "perfect" speech. If the articulators did reach their target positions, however, the resulting speech would sound unnatural (hyperarticulated) and would be much slower than natural speech. For this reason, the actual sound of any given phoneme depends on the phonemes that surround it, resulting in [[https://en.wikipedia.org/wiki/Coarticulation | coarticulation]] . One of the amazing processes of language is how our brains unravel this and hear phonemes as discrete categories of sound ([[references#liberman|Liberman et al., 1957]]). A consequence of how our brains have evolved to untangle coarticulation is that when snippets of audio that were recorded in different contexts are concatenated, they sound unnatural and jarring. An understanding of the concept of coarticulation is very important when planning the recorded output of a speech application.
 *// Use natural pause points to minimize coarticulation effects //**\\
 Coarticulation does not persist over pauses, so you can use natural pause points to help define appropriate audio segments. But don't forget about inflection. In the account example above, even if there's a pause, where it is in the sentence can require different inflection.
+See [[multilingual_applications|Multilingual Applications]] for more information on concatenation and coarticulation in multilingual applications.
 ==== How to record legal/non-barge-in-able messaging ====
 Designers rarely have the only say - or sometimes any say - when it comes to legal messages. Therefore, all of these recommendations come with a caveat of "as much as possible."
@@ Line 98: / Line 101: @@
 Yuschik, M. (2008). Silence locations and durations in dialog management. In D. Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems, 2nd edition (pp. 231-253). New York, NY: Springer.
+{{tag>Recordings}}

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences