Grammar Optimization is the process of using data and analytics to improve the performance of your grammars. There are a few terms to be aware of:
(1) In-grammar (IG): This identifies an utterance that should be covered by your grammar
(2) Out-of-grammar (OOG): This identifies an utterance that is not covered by your grammar. This could either be deliberately (e.g., if you are a bank you don’t recognize “large pepperoni pizza”) or because it’s been missed in your grammar (e.g., if you’re recognizing dollar amounts and you forgot to include “no cents”).
(3) Correct accept (CA): This is an utterance that is recognized by your grammar correctly.
(4) False accept (FA): This is an utterance that should not have been recognized but was incorrectly.
(5) Correct reject (CR): This is an utterance that was rejected by your grammar correctly.
(6) False reject (FR): This is an utterance that was rejected by your grammar but should have been accepted.
(7) Confidence threshold: The recognizer returns a value for how confident it is that it recognized something (usually on a 100 or 1000 point scale). If it’s above the confidence threshold, that utterance is recognized. If it is below, it is rejected.
These different measurements all interact: For example, raising the confidence threshold can remove false accepts, but may also increase false rejects. Optimization is best performed on data from the actual production calling environment despite the temptation to use data collected during, say, the QA cycle. Grammar optimization is the process of getting the best possible results for the recognition task at hand.
The general process for optimizing grammars is:
(1) Collect utterances and associated data.
(2) Transcribe utterances
(3) Run analysis on existing grammar against transcribed utterances
(4) Identify improvements to:
(5) Test the new grammar
(6) Deploy!
Data Collection
Data collected to optimize the grammars should mirror the caller demographics, usage patterns (time of day, day of week, week of month, etc) and environments anticipated in the production calling environment. This data should never come from internal testers. The amount of data required depends on the usage patterns and size of the application, but the goal is to obtain minimally 100 samples for each dialog state in order to obtain statistically significant findings. The data collected generally consists of logged data (consisting of grammars loaded, parameter settings, and recognition results) and the corresponding caller audio files captured by the recognizer.
Data collected to optimize the grammars should mirror the caller demographics, usage patterns (time of day, day of week, week of month, etc.) and environments anticipated in the production calling environment. This data should never come from internal testers. The amount of data required depends on the usage patterns and size of the application, but the goal is to obtain minimally 100 samples for each dialog state in order to obtain statistically significant findings. The data collected generally consists of logged data (consisting of grammars loaded, parameter settings, and recognition results) and the corresponding caller audio files captured by the recognizer.
Optimization requires a large corpus of data (audio files collected by the recognition engine, logs, and transcriptions). The efficacy of the effort is highly dependent upon the quality and quantity of the data. As mentioned above, the data should be collected from production systems over at least a week's time with representative sample for all demographics, usage patterns (time of day, day of week, week of month, etc.) and call flow.
The quantity of data required depends on the complexity of the recognition task, though more data is always better. For relatively simple contexts (Yes/No) a minimum of 200 utterances are needed. More complex contexts will require larger sets of utterances. Natural Language contexts require a minimum of 10,000 utterances. The data collected generally consists of logged data (consisting of grammars loaded, parameter settings, and recognition results) and the corresponding caller audio files captured by the recognizer. Natural Language Optimization additionally requires that each utterance be “tagged” or assigned a semantic value.
Transcription
Once the data set has been collected, the data should be transcribed by human listeners. This process involves listening to each caller utterance captured by the system and annotating what was said, along with any background events that may have occurred during the utterance (for example: side speech, background noise, or coughing). It is best to have documented transcription guidelines to ensure consistency both with the tools used by the speech scientist as well as among transcribers.
The transcriptions should be cleaned up after being completed (e.g., fixing typos, formatting for input into the analysis/tuning process).
If appropriate, you will carve up your transcriptions into two sets: One which you base your changes on and a second that is your “experimental” set. The experimental set is segmented away so it’s like having a clean copy of production data.
In-Grammar vs. Out-of-Grammar
Not every out-of-grammar utterance should be added to a grammar. Many utterances in production data include side conversations, background noise, or responses that suggest that the caller did not understand the question. Such utterances should not be added to a grammar, though significant numbers may indicate a problem with the call flow or prompting. When reporting how the grammar is performing in a given context, it is often useful to consider both the in-grammar accuracy (how often the recognition result was accurate when the response was in-grammar) with the overall accuracy (how often the recognition result was correct, regardless of whether the utterance was in or out-of-grammar).
In examining the results of the analysis, you should see some opportunities for improvement. Are there phrases that aren’t being recognized that should be (they’re out of grammar)? Are there utterances being recognized as something they shouldn’t be (a false accept, e.g., “Go to Malaysia” recognized as “Go to an agent”)? Don’t limit tuning just to the grammars! Occasionally a prompt may need to be changed if it’s leading to a lot of out-of-grammar utterances. In addition to adding and removing items from the grammar, your recognizer may allow for parameters to be set. For example, speech endpointing. If a particular word isn’t being recognized, either because the default pronunciation is wrong or due to regionalisms, consider addition the pronunciation to a lexicon or dictionary. Check with your platform documentation to see details. Use your experimental set to validate your changes and compare against the baseline. Take into account the needs of this particular state and the caller experience: is it better to get the answer wrong and proceed or to reprompt?