Some background on habitability
Habitability refers to the ease, naturalness, and effectiveness with which callers can use spoken-language systems (Watt, 1968). According to habitability theory, there are four domains in which users of spoken-language systems must stay:
“A natural language system must be made habitable in all four domains because it will be difficult for users to learn which domain is violated with the system rejects an expression” (Ogden & Bernick, 1997, p. 139).
The elements of habitability are largely controlled by the capability of the active grammar(s).
Start with the smallest grammar that has a chance of being habitable
All other things being equal, the more habitable an application is, the less accurate the recognition accuracy because the potential for acoustic confusion has increased. Indeed, one common cause of misrecognition is acoustic confusability among currently active phrases in grammar (Fosler-Lussier et al., 2005).
Keeping in mind that callers tend to mimic what they hear in prompts (see Mimicry of Prompt), to as great an extent as possible, create choices that have high acoustic distinctiveness. Don't go overboard, though. If “A” and “B” are what everybody calls two things and there are no reasonable synonyms that could be used in place of one or the other of them, then by no means should you artificially stuff one option into an ill-fitting synonym. Once you've created the choices, restrict the grammar to those choices. Adding back in the acoustically confusable synonyms defeats the purpose of making the choices in the prompt distinct. Also keep in mind, however, that systems that strive for a more conversational tone will necessarily have more complex grammars – but it is often possible to satisfy the desire for a conversational tone and the presentation of choices that have high acoustic distinctiveness. In addition, somewhat counterintuitvely, longer menu options such as “I need assistance” can be easier to for the recognizer to understand than shorter menu options such as “help.” But, this has to be counterbalanced with the need to hear, understand and repeat back the menu option for it to be usable at all.
Build grammars that contain what callers are most likely going to say for each of the menu options
The grammar for each option needs to include more than just the verbatim wording given in the prompt. Every reasonable utterance based on the prompt needs to be included, and don't forget to look at reprompts in case there are any variations. However, you don't want to overgenerate your grammars from the start. Put it what's reasonable,
not what's possible.
Tuning is when you expand the list of synonyms based on actual utterances.
Consider this prompt:
What synonyms should be included for 'set up payment arrangements?' You would definitely want to include 'payment arrangements' by itself without the 'set up' part. Beyond that, it's probably best to wait for tuning data before adding anything else.
Consider ALL prompts at a dialog state
When determining which grammar items to include, make sure to review all prompts that may be spoken at a given dialog state. While an initial prompt may offer an option one way, the reprompts may provide alternate verbiage that should also be included in the response. For example, at a given input state the prompts may be:
Initial Prompt: Which would you like to do? “hear my balances”, “make a payment”, or “setup payment arrangements”. First Reprompt: Sorry? You can say “balances”, “make a payment” or “payment arrangements”.
The grammar for this dialog state would need to include both “hear (my) balances” and “balances”, as well as “setup payment arrangements” and “payment arrangements' based on the prompted options.
Test and tune the grammar
Some phrases will work better than others. During habitability testing, usability testing, and grammar tuning, identify the phrases that have relatively high failure rates and modify them as indicated by the specific patterns of failure. These modifications can include:
For example, consider the US airports “Addison” and “Madison”. Suppose one of the prompts in the application is “Where are you flying from?” Also suppose callers mimic the prompt and say “Flying from Addison” or “Flying from Madison”. Due to coarticulation, “from Addison” and “from Madison” will sound very much alike. You can't arbitrarily decide to not recognize an airport in an airport grammar, but there are some things you could do:
Pick synonyms that fix the problem
When the use of synonyms is possible, consider using a synonym in both prompting and grammar modification. For example, suppose the system is confusing “no” and “new”. You might be able to replace “new” with “recent”, depending on the context, dropping “new” from the grammar.
If it isn't obvious what synonym to use, there are a number of strategies for finding them:
The ultimate check on any change is whether the outcome variables of recognition accuracy and task completion rates improve.
Fosler-Lussier, E., Amdal, I., & Juo, H. J. (2005). A framework for predicting speech recognition errors. Speech Communication, 46, 153–170.
Ogden, W. C., & Bernick, P. (1997). Using natural language interfaces. In M. Helander, T. K. Landauer, & P. Prabhu (Eds.), Handbook of human-computer interaction (pp. 137–161). Amsterdam, Netherlands: Elsevier.
Watt, W. C. (1968). Habitability. American Documentation, 19(3), 338–351