Hey all! New user here, great to see a forum for these kinds of questions!
I’ve got a question about sample sizes for remote user testing.
I’m running some minor UI changes through remote user testing, and I’m curious about your experiences of tests with n values >20 or 30.
I measured completion time for a task that is based on the UI that I’ve tweaked, which is working well with a nice drop in mean completion time for the task. I realise that ‘mean’ is not perfect for this but the improvement is very stark, so it’s doing the job.
I’m also trying to get a measure of satisfaction by using a Customer Effort Score question, which is a 7 point LIkert rating question.
In my tests of n=50 so far, I’ve had good stability of results for the mean completion time. I’ve managed to achieve around a 30% increase in task completion speed, which is great.
But the CES question is quite up and down. For a single design variation, I’ve seen average CES of 4.74, 5, 5.3 and 5.7 (CES is a 7 point scale, so this is a pretty big variation). This was across 4 identical n=50 tests.
When researching, I used:
Sample sizes recommended by NNG: https://www.nngroup.com/articles/quantitative-studies-how-many-users/ - but it’s a very old resource, 2006.
https://www.nngroup.com/articles/card-sorting-how-many-users-to-test/ - for card sorting which is different but they recommend n=30 for a ‘lavish’ study.
https://www.nngroup.com/articles/how-many-test-users/ recommends n=39 for eye-tracking and other quant surveying, but that’s the upper maximum that is recommended.
I thought that n=50 would be ample, but it seems that either, it’s not enough, or I’m barking up the wrong tree.
I feel like CES is a good measure to use for a ‘softer’ metric of customer satisfaction with the UI, and all the literature about it says it’s good for transactional tasks, which is exactly how I’m using it.
I have a limited understanding of the relevant stats maths, hence relying on external knowledge. But it isn’t working well, and I’m casting around for either a good resource that details how to change sample sizes based on the complexity of your changes, or some magic formula that I can use to determine how many people I need to get this CES response more stable.
Does anyone have any experience with this? Or science they can drop on me?
Thanks in advance!