Usability Testing

Usability Testing (UT) is the process of watching and listening to “real” people use an application in likely or realistic scenarios. In contrast to Usability Assessments (inspection methods such as heuristic or expert evaluation), UT reduces subjective opinion (both experts and client management) by focusing on how people actually behave when interacting with an application. Usability Testing offers methodological controls that allow us to compare different groups of users or test competing design alternatives in a rigorous, but feasible way.

Usability Testing of speech applications follows the same general philosophy and methods as UT for other applications, but some differences exist. One way in which UT of speech applications differs from other applications is the inability to use the Think-Aloud (TA) methodology (also see Lewis, 2012). You can't ask participants to say what they're thinking as they use a speech recognition system. Instead, the test facilitator can interview the participant immediately following each interaction to obtain their reactions.

The success of UT depends primarily on three factors: 1) how well the test participants represent the background knowledge, attitudes, and situations of the people who will use the live system; 2)how well the test scenarios simulate realistic situations and provide participants with believable reasons for making calls; and 3) the degree to which the the system being used replicates the behavior of the production application.

Usability test may be conducted using speech application at many different stages throughout design and development of the application. Testing with a fully functional application generally must occur after development is complete, and provides the most grounded, realistic data, but it may be delivered too late in the project to be maximally useful. Testing with a less-functional, less realistic application can often happen earlier, but because users are interacting with a system that is not identical with the production application, the data are not as robust. One specific early UT method used for speech applications is known as “Wizard of Oz” (WOZ) testing, which can be conducted before the real system is completed. WOZ testing is particularly valuable when there are questions about how the target audience will interact (e.g., Sadowski & Lewis, 2001), but has some limitations relative to testing with a working prototype or deployed system, notably weakness in detecting problems with recognition, audio quality, and turn-taking or other timing issues (Sadowski, 2001).

A typical usability test for a single user population requires two days of testing, with six participants each day taking up to an hour each. This is not a hard-and-fast rule – sessions may be longer or shorter as required, and distributed over more days, especially if there are multiple distinct user groups who must be included in the test . There are statistical methods for estimating and validating sample sizes for these types of formative usability studies – for a review, see Chapter 7 of Sauro and Lewis (2012) or Lewis (2012, pp. 1292-1297).

There are several costs involved.

It is also possible to conduct remote usability test sessions via conference call for WOZ testing. You can record these sessions for later analysis/review by using a phone tap, built-in recording facilities for IP telephony (if available), or running a video camera to record the audio from a speaker phone.

The basic deliverable from UT is a written list of specific recommendations based upon observations made during testing. Typically there will be recommendations for changes to the design of the application, and for tuning of recognition grammars. There may also be broader recommendations for changes to client procedures for serving customers so that the total customer experience of the client company is a positive and profitable one.

Secondary deliverables can include quantitative usability metrics such as task completion times, task completion rates, and satisfaction metrics (Sauro & Lewis, 2012). Chapter 6 of Sauro and Lewis (2012) provides comprehensive guidance on determining how many participants to evaluate in this type of formative usability test (also see Lewis, 2012). For a published example of a usability evaluation of a speech recognition IVR, see Lewis (2008).

References

Lewis, J. R. (2008). Usability evaluation of a speech recognition IVR. In T. Tullis & B. Albert (Eds.), Measuring the user experience, Chapter 10: Case studies (pp. 244–252). Amsterdam, Netherlands: Morgan-Kaufman.

Lewis, J. R. (2012). Usability testing. In G. Salvendy (Ed.), Handbook of Human Factors and Ergonomics, 4th ed. (pp. 1267-1312). New York, NY: John Wiley.

Sadowski, W. J. (2001). Capabilities and limitations of Wizard of Oz evaluations of speech user interfaces. In Proceedings of HCI International 2001: Usability evaluation and interface design (pp. 139–142). Mahwah, NJ: Lawrence Erlbaum.

Sadowski, W. J., & Lewis, J. R. (2001). Usability evaluation of the IBM WebSphere “WebVoice” demo (Tech. Rep. 29.3387, available at drjim.0catch.com/vxmllive1-ral.pdf). West Palm Beach, FL: IBM Corp.

Sauro, J., & Lewis, J. R. (2012). Quantifying the user experience: Practical statistics for user research. Burlington, MA: Morgan Kaufmann.