Improving the Reliability of Usability Evaluations
It is well documented that different evaluators conducting usability evaluations of the same product often come up with disparate findings. Does the suspect reliability of usability evaluations mean we should stop conducting them all together? The answer is a clear and resounding no.
The reality is that usability evaluations can yield different results, depending on the way in which they are conducted. More importantly, however, usability evaluations identify important use-related challenges and hazards, and provide insight into how a design can be improved.
In this blog, we’ll explore challenges associated with conducting reliable usability evaluations and offer insights as to how to overcome these challenges. We’ll also discuss how to improve usability testing practices to ensure we are identifying the most important issues.
Usability Evaluation Methods
There are two primary methods of evaluating the usability of a product: (1) usability tests and (2) expert reviews. The key difference between the two is that usability tests are conducted with representative users, while expert reviews (e.g., heuristic analyses, cognitive walkthroughs) are typically performed by usability professionals and/or domain experts. Both methods use a set of tasks that help evaluators identify usability issues, and in both methods, usability professionals analyze the data in order to categorize problems, rate them according to a defined severity, and offer recommendations for design improvement. While there are numerous methods for evaluating usability, one thing is clear—different evaluators can produce different results.
Overlap of Usability Issues
One might reasonably assume that expert usability professionals conducting different evaluations of the same product would uncover the same usability problems. Unfortunately, the reality is quite different. Numerous studies have explored this issue, the most prominent being the Comparative Usability Evaluation (CUE) series. A striking portrait of the lack of overlap is painted when you look at the number of unique issues reported by single teams across the first four studies of the series:
- CUE-1 – 91%
- CUE-2 – 75%
- CUE-3 – 61%
- CUE-4 – 60%
The CUE-4 study represents the most comprehensive comparison of usability studies to date, involving 17 usability teams in total, nine of which performed expert reviews and eight of which conducted usability tests.1 As seen above, 60% of all problems reported were identified by only one team. Many others have found strikingly similar overlap in their own research. See, for example, Jeff Sauro’s blog “How Effective are Heuristic Evaluations?”
Factors Affecting the Reliability of Usability Evaluations
A large part of the problem can be attributed to the variables affecting both types of usability evaluations. We’ll discuss several of the most important variables affecting usability evaluations, and offer some practical insights as to how to reduce their impact on reliability. This list is far from comprehensive, and we invite readers to add additional variables in the comments section at the end.
Task selection is an important aspect of both usability tests and expert reviews because the tasks performed greatly affect the interaction that test participants and/or evaluators experience. As Molich and Dumas point out, “A usability problem—even a critical one—normally will only be discovered if there are tasks that focus on the interface where the problem occurs” (p.275). Development teams should define the primary operating functions and frequently used functions, and then create tasks that will allow participants to interact with these areas. In medical device development, tasks should also be created to address potential use-related hazards defined during risk analysis.
Unfortunately, evaluation teams may use different instructions and ways of interacting with participants during a usability test, and these subtle differences can bias the results. At Farm, each moderator closely follows the same protocol. We also conduct pilot sessions to ensure participants fully understand the questions and are not biased by the way questions are framed. During formative testing, the think-aloud protocol may also uncover instances where the task instructions are misleading the user and/or causing confusion. For a more comprehensive discussion of creating a successful protocol, see Beth Loring and Joe Dumas’ “Moderating Usability Tests.”
Categorization of Problems
The categorization of usability problems is also important. In the CUE-4 study, participants were asked to use predefined categories. The authors found that identical issues were sometimes classified as positive findings and sometimes classified as usability issues. In other studies, including previous CUE studies, evaluators were asked to define their own categories and scales. The language used to define problems will undoubtedly affect the way people, including clients, understand test results. It is important that categories be easily defined and understood by the development team.
In a study of heuristic evaluations conducted by Nielsen2, “double experts,” or usability experts with extensive knowledge of the specific domain being studied, performed better than usability professionals without domain expertise. It is sometimes suggested that usability professionals who become too knowledgeable about a device can lead test participants during the evaluation, but we have found that evaluators who take the time to understand a product will produce a better and more relevant list of issues than those who do not.
There is no hiding the subjective nature of providing recommendations. Nevertheless, this is a critical juncture in the process, one that represents a shift from research to solving problems. Some common problems include: overly vague recommendations, recommendations that are in direct conflict with business goals, recommendations that reflect personal opinion alone, and implicit recommendations. To avoid some of these pitfalls, it is critical that evaluators provide solid evidence for how a recommendation supports the issue that is uncovered. Similarly, we have found that recommendations are most useful when the usability team has been closely involved in the development process.
It is important to know that in a medical device summative report, third-party evaluators such as Farm are not supposed to suggest how an issue will be mitigated. We simply report the issue and provide the root cause from the user’s perspective. It is up to the device manufacturer to report to the FDA how they fixed the problem and re-tested the issue.
The Value of Usability Testing
According to available research, the results of usability testing and expert reviews can be inconsistent across evaluators. Fortunately, they can be made more reliable by applying rigor to various aspects of usability evaluations, including the test protocol and task selection. A usability evaluation, while based on the fundamental principles of behavioral science, is a tool used to provide better and safer products, and it should be judged on its ability to inform design change, to improve the user experience, and to improve the safety of medical devices. The science of evaluating products is not perfect, but if we keep the end goal in mind, we will have a better appreciation for the positive impact that usability evaluations have on product development.
1 Molich, R., & Dumas, J.S. (2008). Comparative usability evaluation (CUE-4). Behavior & Information Technology, 27(3), 263-281.
2 Nielsen, J., and Molich, R. (1990, April). Heuristic evaluation of user interfaces. CHI 1990 Proceedings, 249-256. Seatle, Washington.