Many UX experts and gurus have argued that feedback from 5 participants is sufficient for evaluating usability. I think that this may have been true in the past but nowadays, given the changing demographics of tech users, at least 10 participants are usually required for meaningful results. There are many factors that influence the number of participants required for a study and I will try to present my view of these considerations.
The basic starting point is to have at least 3 participants from each unique user group for each task. I think 3 participants is a bit on the low side and my preference is to have at least 4, but ideally 5 participants. Since there may be unforeseen events, all 5 participants may not show up. There are two options for how to get at least 5 participants to show up for your study. The first one is to schedule a standby participant. This is a person who shows up to the testing and is only asked to participate if another participant doesn’t show up. The problem is that this person often has to arrive and wait for several hours. Once you know that you have sufficient participants he is dismissed, but it’s expensive to compensate for this extended time.
The other option is to over-recruit. This means that you schedule 6-7 participants and anticipate that some participants won’t show up. This still tacks on an extra cost, but it is cheaper than having standby participants. The recruit vendors I have been working with at the User Experience Center have a show-up rate (percentage of participants who come to the session) of over 90%. This means that if we schedule 6 participants the probability of at least 5 showing up is roughly 88% (and to get at least 4 to show up it’s approximately 98%). Therefore, scheduling 6 participants provides for the best result.
Now, I also stated “each unique user group to go through each task” so let’s talk about user groups. If a significant difference is expected between user groups it is necessary to recruit 6 participants for each of those groups. If the system is being used by both power users and less frequent users, for example, it is likely they will interact with it differently. Power users tend to rely more on recognition and learned sequences so they are more likely to be able to handle having a lot of information on the same screen as well as having all of the menu options in one large mega menu. Less frequent users, on the other hand, need more assisted direction, so if all the information is cluttered on the same screen they may have a harder time locating what they need. This means that to capture the unique experiences of these two groups, you’d need to double the number of participants for the test.
Another thing to consider when deciding participant count for user groups is age. I have observed over and over again that there is a significant difference between older (roughly around 55 and above) and younger (roughly under 30) study participants. One example is that almost all young participants understand the ‘hamburger menu’ in a tablet application, while approximately 25% of the older population don’t even find the navigation and often think the three lines are part of the logo. During the early ages of the Internet when the common approach was to schedule 5 participants (to have at least 3 show up), all tech savvy people were essentially in the same age bracket. Therefore, it wasn’t necessary (or possible!) to have a cross section of ages represented in the study. Besides familiarity with technology, we also need to consider another factor—vision decreases with age (The importance of accessibility).
Furthermore, the number of participants depends on the scope of the system that will be covered in testing. The more areas that are tested, the more tasks will need to be performed. If there are many tasks, the first thing to do is to increase session length, but I advise against a session longer than 90 minutes since the participant will suffer from fatigue. If all tasks can’t fit in a 90 minute session, one solution is to start to rotate tasks. If the tasks are rotated (each participant performs a randomized set of 10 out of 15 tasks, for example) it is necessary to increase the number of participants since it is desirable to have at least 4-5 participants go through each of the tasks.
Lastly, keep in mind how many platforms are being tested. If there are a few tasks on each platform, it may be possible to allow the same participant to test all the different platforms; otherwise the number of participants will need to be increased.
So far I have primarily discussed this from a “milestone test” point of view (final testing before moving over to the next phase of the design). But there is another type of testing that I call the “sanity check”. The sanity check is a quick test that designers should be doing as they design the interface. These can be conducted with 3 participants and can be done very cheaply though online testing (if there is no confidential information). These sanity checks are for the designer to make sure there aren’t any major issues while also working in a ‘lean’ way.
There are other factors to consider as well (budget, combinations of user groups, etc.), but if these basics are kept in mind the researcher should be in pretty good shape. In summary, most “milestone tests” should schedule at least 12 participants (unless the product only targets one user group), while a “sanity check” can be conducted with as few as 3 participants.
© David Juhlin and www.davidjuhlin.com, 2015