Voice Talent

Note: This page assumes you have chosen to use professionally recorded voice talent. If you are using Text-to-Speech, for more information, see that page. For more on choosing between using recordings vs. Text-to-Speech, see the page Recordings vs. Text to Speech

Recommendations and Considerations

Using a Voice Talent Agency

Clients may ask if it is possible to save money by having a friend or employee do the recorded audio. This should be avoided if at all possible, as it is difficult to get high quality audio from untrained talent and engineers. Using a voice talent agency is preferred for all professional projects.

Voice talent agencies (two prominent ones in the industry are GM Voices and Walsh Media) have a lot of experience in voice recording and audio production, and because they rely quite a bit on repeat business, are highly motivated to work with VUI designers and clients to select a high-quality voice talent consistent with enterprise branding (Graham 2005, 2010). Their talents tend to stick around for a long time, which is critical for ongoing changes unless you want to re-record the entire system. Some voice talent agencies may also provide a guarantee; if the talent leaves the agency, they will re-record your prompts at low or no costs.

Good voice talent agencies also provide a high degree of quality control. For example, typically their staff will read the message listing prior to the session and make suggestions about prompt content. Their staff may also take the message listing and put it into the format preferred by the talent. The sessions are recorded with the help of an audio engineer, who is responsible for maintaining a consistent acoustic quality across sessions, taking notes on the various “takes” that occur during the sessions, and producing properly formatted files. An experienced voice talent agency will also allow a VUI designer to dial into the session via a phone bridge, in order to listen to the session and provide live coaching to the voice talent as the session is being recorded. When selecting a talent agency to work with for the first time, asking them if they provide these types of services is a good way to gauge exactly how “full-service” the agency will be.

It is also possible to go directly to an independent voice talent. Many of them maintain their own studios, do their own audio engineering, and will enter into contracts directly with customers. This may allow a client to have a voice which is exclusive to their brand. However, the quality and technical know-how of independent talents will vary widely. For example, you may need a recorded audio file in a very specific format; not all independent talents are trained as audio engineers (although some are), and the talent you are working with may or may not be able to understand your requirements and deliver them correctly.

Recently, talent warehouse websites have arisen on the internet; one prominent one is The Voice Realm. These talent warehouse sites should not be mistaken for full-service voice talent agencies. Talent warehouses are simply aggregators. They allow independent talents to list themselves on the website. Using a talent warehouse, you can put out a “casting call” to a wide variety of independent talents who will then record samples of your audio. These talent warehouses are usually fairly inexpensive. However, when you are using one, you are simply using the aggregator's website to find and pay an independent talent who is often running their own studio and doing their own audio engineering. Therefore, even when you are using the same aggregator website, the quality, skill and technical ability will vary widely depending on the individual talent you are working with.

Your choice of which type of talent studio to work with should be driven by your business needs and level of expertise you have available. If you are embarking on a large project for an important customer, an experienced voice talent agency will be a reliable partner who can consistently deliver high quality audio, saving you project time. For a small demo or a concept clip, an independent talent or warehouse could be feasible, but you should expect to spend more time engaging with the talent/engineer and QAing the audio.

Maintain consistency across the brand

If a voice talent already represents the brand in other media, for example, advertising, you should consider using that talent for your voice-enabled application. However, this is generally not practical if current branding employs a celebrity voice. Celebrities are often pricey and not readily available.

Also, consider the purpose of the voice-enabled application you are creating. Alexa Skills and Google Actions are often more heavily branded as they may be part of sales or marketing projects, whereas IVRs, being based in the customer service area of an organization, may be less heavily branded, because the customers may be calling for a different reason. So, consider whether or not the marketing campaign which features the celebrity is consistent with the goals of your voice-enabled application. For example, a prominent insurance company has a strong existing audio brand using a duck; however, when considering extending this audio brand to the IVR, the team decided that the duck is a “happy duck” and therefore would not be appropriate for a customer interaction over a phone line which is often used by customers calling in regards to claims, which may be sad events.

If there is professional voice talent doing branding across several media channels, then there's a much stronger case for using the same talent for the IVR for consistency.

Tying voice talent selection to the desired system persona

The user persona developed by the business should drive the system persona designed for the voice application. (For more on user persona and system persona, see the page Persona and Brand) This system persona should then drive the selection of voice talent. For example, consider an application developed by a cosmetics company to remind their independent beauty consultants of the birthdays and anniversaries of their colleagues and coworkers. In order to resonate with their consultants, the brand wishes to convey a system persona that is upbeat, fun, and feminine. The

Vocal characteristics

Register – High pitch or low pitch

Soprano / alto / tenor / bass

Open-ness, resonance, sonority

Associated with depth, size, seriousness

Chest-voiced quality

“James Earl Jones”

Closed-ness, nasalness

Associated with quickness, lightness

May be irritating

Head-voiced quality

“Mr. Burns”

“Read” or delivery How the talent reads the lines Speed, pacing Spaces Intonation Good talents can change a lot of their characteristics through different “reads”

Voice characteristics - need an article on this?

Give the client choices

If seeking a new voice talent, keep the client in the loop. It usually works well to provide clients with samples of three or four voices so they can choose the voice they feel best represents their company. Letting the stakeholders vote privately makes for a fun reveal and discussion of why voices were chosen.

What you will often find is that there isn't a clear favorite. This is OK. There are lots of voices that can pull off any given design. Always give your stakeholders choices where you'd be happy with any outcome. Equally important to the voice itself is the talent's ability to respond to coaching and deliver the messages the way they were intended.

Consider gender
See the gender section below for more details. Bottom line is it doesn't matter a whole lot. Ask the client if they have a preference. If you as the designer feel something about the corporate culture lends itself to one or the other, then make that recommendation.

Involve the right stakeholders in the decision
Make sure the highest-level executive who cares about the IVR voice is engaged in the selection process. “Trust me—you do not want to be in a meeting where you’re presenting the working version of the application (including all professional recordings) to the senior vice-president in charge of customer care who, upon hearing the voice for the first time, says, 'I hate it. We need a different voice'” (Lewis, 2011, p. 103).

Gender

Do not overemphasize gender
There is no compelling research to indicate an advantage based solely on the gender of the voice talent (Couper, Singer, & Tourangeau, 2004; Lewis, 2011). For average listeners in normal channels, “…there is little evidence to suggest that one sex of speaker is more intelligible than another, if other factors are ruled out. For example, males may typically have louder voices than females, and female voices may be more high-pitched than males, but if these factors are controlled for, any sex differences usually disappear” (Edworthy & Hellier, 2005).

There is a general tendency in the US to use a female voice for IVRs (likely due to their service-provider orientation – for a historical perspective, see Yellin, 2009), but there are numerous examples of successful use of male voices in IVRs. Find out if your client cares and, if so, take that into account when selecting a voice or set of voices to review.

There is no question that we all carry conscious and unconscious stereotypes in our heads. In recent years, the psychologist most strongly associated with research in how these stereotypes affect human-computer interaction is Clifford Nass (Nass & Brave, 2005; Nass & Yen, 2010; Reeves & Nass, 2003), most notably in the book, “Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship”. In that book, Nass and Brave (2005) described experiments in which different types of people used speech applications (notably, with most of the experiments using TTS rather than professional voice talents for their audio). In most of the studies they replicated classic social psychology studies of interactions between humans, replacing one of the humans with a speech-enabled computer, a variation of the “computers as social actors” (CASA) paradigm.

For example, they replicated the “similarity attraction” effect, the finding that people are attracted to other people who are similar to themselves. In these laboratory experiments, extroverts preferred an extroverted user interface and males preferred to hear a male voice. It turns out, however, that it is difficult to apply many of these findings to user interface design (e.g., how would you know in advance if a caller were male or female, introvert or extrovert). Additionally, they reported that people tend to rate male voices as more trustworthy (especially male listeners), and to expect females to be more nurturing.

Despite the reliability with which these social effects appear in replications of social psychology experiments, they are not as reliable when assessed in real-world systems that are otherwise usable, that is, efficient, effective, and pleasant ((Balentine, 2007)). Lewis(2011), in an analysis of data from studies of the perception of the quality of TTS voices (both male and female) rated by both males and females, did not find any significant Voice Gender by Listener Gender interaction, an interaction that the similarity attraction hypothesis would have predicted (and an effect replicated by Machado et al., 2012). Couper, Singer, & Tourangeau (2004) studied the influence of male and female artificial voices on more than 1000 respondents to an IVR survey on sensitive topics. They measured respondents’ reactions to the different voices and abandoned call rates, and found no statistically significant results related to the gender of the voices. In particular, there were no significant Voice Gender by Respondent Gender interactions.

“Why such strong effects of humanizing cues are produced in laboratory studies but not in the field is an issue for further investigation. … Across these studies, little evidence is found to support the ‘computers as social actors’ thesis, at least insofar as it is operationalized in a survey setting” (Couper et al., 2004, p. 567).

Coaching, inflection

Use a coach during the recording session
The coach must be familiar with the design
Even though you have hired an agency to do the recordings with professional voice talents in a professional recording studio, it's unrealistic to expect the voice talent to understand all the nuances of expression when reading from a written recording manifest (list of the audio segments to record). You need to have someone present during the recording session(s) to coach the voice talent regarding context, emphasis, and appropriate tone. Note that most agencies have the capability for coaches to phone into the session. And if they don't, you probably want to find another agency. Coaches will sometimes travel to the talent for a long session for the initial release of a system, but over the phone coaching is more the norm, especially for subsequent sessions. Not coaching is not an option.

The coach needs to be able to decide on the fly when it's appropriate and necessary to interrupt the voice talent. The coach needs to foster a relaxed environment and to listen attentively throughout the session. Coaches must be familiar with the target voice attributes, have a good ear for subtle voice differences, and be able to guide the voice talent without being offensive. The coach also has to be familiar with the system being recorded and the context of each prompt. In a lot of cases the coach and the designer are one and the same because of that familiarity with the system. Whoever wrote it knows what it's supposed to sound like. However, not all designers make good coaches.

Add coaching notes to the design
Note in the recording manifest when it is important to emphasize a word or phrase. This can make all the difference between a prompt that guides the caller to say something the grammar can understand or misleads the caller into saying something out of grammar.

For any given sentence or phrase, there are many ways to speak it, only one or a few of which will be appropriate in a given context. For example, what is the correct way to record the question (appearing in a list of frequently asked questions), “What happens after I apply for cash assistance?” Should the speaker emphasize “What,” “happens,” “after,” “apply,” or “cash assistance”?

The answer depends on the question's context. If the surrounding items concern other aspects of applying for and getting cash assistance, then plan to emphasize “after,” contrasting it with the things that happen before applying. If the surrounding items have to do with other types of assistance such as food stamps or health benefits, then plan to emphasize “cash assistance.” It’s critical to get the prosodic element of contrastive stress correct (Cohen, Giangola, & Balogh, 2004; Lewis, 2011).

Usage notes are also helpful, especially when recording small pieces that will later be concatenated together. Knowing that something will be an element in a list or the last thing in a fill-in-the-blank sentence makes all the difference in the world in how it's recorded.

These notes in the manifest are all the more important if the designer is not the coach. They will be invaluable to the coach and voice talent.

For considerations to make when selecting a voice talent for a multilingual application, see Multilingual Applications.

References

References

Balentine, B. (2007). It’s better to be a good machine than a bad person. Annapolis, MD: ICMI Press.

Cohen, M. H., Giangola, J. P., & Balogh, J. (2004). Voice user interface design. Boston, MA: Addison-Wesley.

Couper, M. P., Singer, E., & Tourangeau, R. (2004). Does voice matter? An interactive voice response (IVR) experiment. Journal of Official Statistics, 20(3), 551–570.

Edworthy, J. & Hellier, E. (2006). Complex nonverbal auditory signals and speech warnings. In (Wogalter, M. S., Ed.) Handbook of Warnings (pp. 199-220). Mahwah, NJ: Lawrence Erlbaum.

Graham, G. M. (2005). Voice branding in America. Alpharetta, GA: Vivid Voices.

Graham, G. M. (2010). Speech recognition, the brand and the voice: How to choose a voice for your application. In W. Meisel (Ed.), Speech in the user interface: Lessons from experience (pp. 93–98). Victoria, Canada: TMA Associates.

Lewis, J. R. (2011). Practical speech user interface design. Boca Raton, FL: CRC Press, Taylor & Francis Group.

Machado, S., Duarte, E., Teles, J., Reis, L., & Rebelo, F. (2012). Selection of a voice for a speech signal for personalized warnings: The effect of speaker's gender and voice pitch. Work, 41, 3592-3598.

Nass, C., & Brave, S. (2005). Wired for speech: How voice activates and advances the human-computer relationship. Cambridge, MA: MIT Press.

Nass, C., & Yen, C. (2010). The man who lied to his laptop: What machines teach us about human relationships. New York, NY: Penguin Group.

Reeves, B., & Nass, C. (2003). The media equation: How people treat computers, television, and new media like real people and places. Chicago, IL: University of Chicago Press.

Yellin, E. (2009). Your call is (not that) important to us: Customer service and what it reveals about our world and our lives. New York, NY: Free Press.