Loading...
Please wait, while we are loading the content...
Similar Documents
Current Issues in the Determination of Usability Test Sample Size : How Many Users is Enough ?
| Content Provider | Semantic Scholar |
|---|---|
| Author | Turner, Carl W. Nielsen, Jakob |
| Copyright Year | 2002 |
| Abstract | The topic of “how many users” is of great interest to usability specialists who need to balance project concerns over ROI and timelines with their own concerns about designing usable interfaces. The panel will review the current controversies over usability test sample size, test validity, and reliability. Introduction Virzi (1992), Nielsen and Landauer (1993), and Lewis (1994) published influential articles on the topic of sample size in usability testing. In these articles, the authors presented a mathematical model for determining the sample size for usability tests. The authors presented empirical evidence for the models and made several important claims: • Most usability problems are detected with three to five subjects • Running additional subjects during the same test is unlikely to reveal new information • Most severe usability problems are detected by the first few subjects (claim supported by Virzi’s data – not supported by Lewis’ data) Virzi’s stated goal of determining an appropriate number of test subjects was to improve return on investment (ROI) in product development by reducing the time and cost involved in product design. Nielsen and Landauer (1993), building on earlier work by Nielsen (1988; 1989) and Nielsen et al. (1990), replicated and extended Virzi’s (1992) original findings and reported case studies that supported their claims for needing only small samples for usability tests. The “small sample” claims and their impact on usability methodology have been popularized in Nielsen’s (2000) widely read “useit.com” online column. Since that time, a number of authors have challenged Virzi’s and Nielsen’s “small sample” findings on methodological and empirical grounds (Bailey, 2001; Caulton, 2001; Spool & Schroeder, 2001; Woolrych & Cockton, 2001). Additionally, two large-scale experiments on usability test methods have been conducted that bear directly on Virzi and Nielsen’s claims (Molich et al, 1998; Molich et al, 1999). The topic of “how many users” is of great interest to usability specialists who need to balance project concerns over ROI and timelines with their own concerns about designing usable interfaces. The goals of this panel discussion are to review the current controversies over usability test sample size and lead a discussion of the topic with the audience: • Examine the original “small sample” claims, including Nielsen’s (1990), Nielsen and Landauer’s (1993), Virzi’s (1992), and Lewis’ (1992, 1994) articles • Review the responses, including studies that deal with reliability of usability testing • Make suggestions and recommendations for follow up studies • Each panelist will present their perspective on the topic of usability sample size (15 minutes). The panelists will then lead discussion of the topic with attendees. The Original Claims The earliest use of the formula 1-(1-p) for the purpose of justifying a sample size in usability studies was in Lewis (1982). In this paper, the formula was derived from the cumulative binomial probability formula. The claims from this paper were that “the recommended minimum number of subjects depends on the number of times a problem must be observed before it is regarded as a problem and the magnitude of the proportion of the user population for which one wishes to detect problems” (p. 719). Wright and Monk (1991) also offered the formula as a means for estimating minimum sample sizes for usability studies, concluding “even a problem that has only a probability of 0.3 of being detected in one attempt has a very good chance of being detected in four attempts” (p. 903). Neither Lewis (1982) nor Wright and Monk (1991) provided any empirical data to assess how well the formula modeled problem discovery in practice. Virzi (1990, 1992) was one of the first researchers to provide empirical data supporting the use of the formula. He reported three experiments in which he measured the rate at which trained usability experts identified problems as a function of the number of naive participants they observed. He used Monte Carlo simulations to permute participant orders 500 times to obtain the average problem discovery curves for his data. Across three sets of data, the average likelihoods of problem detection (p in the formula above) were 0.32, 0.36, and 0.42. He also had the observers (Experiment 2) and an independent group of usability experts (Experiment 3) provide ratings of problem severity for each problem. Based on the outcomes of these experiments, Virzi (1992) made three claims regarding sample size for usability studies: (1) Observing four or five participants allows practitioners to discover 80% of a product’s usability problems, (2) observing additional participants reveals fewer and fewer new usability problems, and (3) observers detect the more severe usability problems with the first few participants. Nielsen and Molich (1990) also used a Monte Carlo procedure to investigate patterns of problem discovery in heuristic evaluation as a function of the number of evaluators. The major claims from this paper were that individual evaluators typically discovered from about 20 to 50% of problems available for discovery, but combined information from individual evaluators into aggregates did much better, even when the aggregates consisted of only three to five evaluators. Nielsen (1992) replicated the findings of Nielsen and Molich, and had results that supported the additional claims that evaluators with expertise in either the product domain or usability had higher problem discovery rates than novice evaluators. They also found data to support the claim that evaluators who were experts both in usability and the product domain had the highest problem discovery rates. Seeking to quantify the patterns of problem detection observed in several fairly large-sample studies of problem discovery (using either heuristic evaluation or user testing) Nielsen and Landauer (1993) derived the same formula from a Poisson process model (constant probability path independent). They found that it provided a good fit to their problem-discovery data, and provided a basis for predicting the number of problems existing in an interface and performing cost-benefit analyses to determine appropriate sample sizes. Across 11 studies (five user tests and six heuristic evaluations), they found the average value of p to be .33 (ranging from .16 to .60, with associated estimates of lambda ranging from .12 to .58). (Note that Nielsen and Landauer used lambda rather than p, but the two concepts are essentially equivalent. In the literature, lambda, L, and p are commonly used to represent the average likelihood of problem discovery.) Lewis (1992, 1994) replicated the techniques applied by Virzi to data from an independent usability study (Lewis, Henry, & Mack, 1990). The results of this investigation clearly supported Virzi’s second claim (additional participants reveal fewer and fewer problems), partially supported the first (observing four or five participants reveals about 80% of a product’s usability problems as long as the value of p for a study is in the approximate range of .30 to .40), and failed to support the third (there was no correlation between problem severity and likelihood of discovery). Lewis noted that it is most reasonable to use small-sample problem discovery studies “if the expected p is high, if the study will be iterative, and if undiscovered problems will not have dangerous or expensive outcomes” (1994, p. 377). He also pointed out that estimating the sample size requirement for the number of participants is only one element among several that usability practitioners must consider. Another key element is the selection and construction of the tasks and scenarios that participants will encounter in a study – concerns that are similar to the problem of assuring content validity in psychometrics. Research following these lines of investigation led to other, related claims. In two thinking-aloud experiments, Nielsen (1994a) found that the value of p for experimenters who were not usability experts was about .30, and that after running five test participants the experimenters had discovered 77-85% of the usability problems (replicating the results of Nielsen & Molich, 1990). Dumas, Sorce, and Virzi (1995) investigated the effect of time per evaluator and number of evaluators, and concluded that both additional time and additional evaluators increased problem discovery. They suggested that it was more effective to increase the number of evaluators than to increase time per evaluator. Critical Responses to the Original Claims Molich et al. (1998, 1999) conducted two studies in which they had several different usability labs evaluate the same system and prepare reports of the usability problems they discovered. There was significant variance among the labs in the number of problems reported, and there was very little overlap among the labs with regard to the specific problems reported. Kessner, Wood, Dillon, and West (2001) have also reported data that question the reliability of usability testing. They had six professional usability teams test an early prototype of a dialog box. The total number of usability problems was determined to be 36. None of the problems were identified by every team, and only two were reported by five teams. Twenty of the problems were reported by at least two teams. After comparing their results with those of Molich et al. (1999), Kessner et al. suggested that more specific and focused requests by a client should lead to more overlap in problem discovery. Hertzum and Jacobsen (1999, 2001) have described an ‘evaluator effect’ – that “multiple evaluators evaluating the same interface with the same usability evaluation method detect markedly different sets of problems” (Hertzum & Jacobsen, 2001, p. 421). Across a review of 11 studies, they found the average agreement between any two evaluators of the same system to ranged from 5 to 65%, |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | https://static.aminer.org/pdf/PDF/000/240/410/problem_discovery_in_usability_studies_a_models_based_on_the.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |