Sample sizes for remote user testing


#1

Hey all! New user here, great to see a forum for these kinds of questions!

I’ve got a question about sample sizes for remote user testing.

I’m running some minor UI changes through remote user testing, and I’m curious about your experiences of tests with n values >20 or 30.

I measured completion time for a task that is based on the UI that I’ve tweaked, which is working well with a nice drop in mean completion time for the task. I realise that ‘mean’ is not perfect for this but the improvement is very stark, so it’s doing the job.

I’m also trying to get a measure of satisfaction by using a Customer Effort Score question, which is a 7 point LIkert rating question.

In my tests of n=50 so far, I’ve had good stability of results for the mean completion time. I’ve managed to achieve around a 30% increase in task completion speed, which is great.

But the CES question is quite up and down. For a single design variation, I’ve seen average CES of 4.74, 5, 5.3 and 5.7 (CES is a 7 point scale, so this is a pretty big variation). This was across 4 identical n=50 tests.

When researching, I used:

  1. Sample sizes recommended by NNG: https://www.nngroup.com/articles/quantitative-studies-how-many-users/ - but it’s a very old resource, 2006.

  2. https://www.nngroup.com/articles/card-sorting-how-many-users-to-test/ - for card sorting which is different but they recommend n=30 for a ‘lavish’ study.

  3. https://www.nngroup.com/articles/how-many-test-users/ recommends n=39 for eye-tracking and other quant surveying, but that’s the upper maximum that is recommended.

I thought that n=50 would be ample, but it seems that either, it’s not enough, or I’m barking up the wrong tree.

I feel like CES is a good measure to use for a ‘softer’ metric of customer satisfaction with the UI, and all the literature about it says it’s good for transactional tasks, which is exactly how I’m using it.

I have a limited understanding of the relevant stats maths, hence relying on external knowledge. But it isn’t working well, and I’m casting around for either a good resource that details how to change sample sizes based on the complexity of your changes, or some magic formula that I can use to determine how many people I need to get this CES response more stable.

Does anyone have any experience with this? Or science they can drop on me?

Thanks in advance!
H


#2

I don’t think your sample size is the issue here, but you might be seeing either some of the shortcomings of the CES survey instrument or deficiencies in your screener. If you are recruiting GenPop respondents then individual differences might be the reason for the discrepancies - and unless you’re asking them why they gave a particular response, you’ll never know.

I’ve recently been digging into these single ease questions and for my situation, I’m leaning towards either just asking a CSAT Satisfaction question and keeping it simple OR doing something more inline w/ NetEasy (CES + a “Why did you give that response?” question).


#3

Thanks for the info ryan! Yeah perhaps the CES tool is just not very high resolution when using general population candidates. I’m using a remote panel so don’t know a lot about the respondents.

I was hoping to stick to simple questions that produce direct metrics for these tests but adding a ‘why’ in would probably shed a little more light on the situation. But part of me wonders that if the CES numbers aren’t stable, would free text analysis on a ‘why’ question be any more stable?

Obviously, it’s useful for other reasons – finding insights and direct quotes and so on, but I’m doing a lot of these tests and want to spin through them really quickly. Although reading that back, perhaps I’m cutting my nose off to spite my face :slight_smile:


#4

There are some useful thoughts in https://measuringu.com/small-n/ about what you can glean from small sample sizes. tl;dr you need to add error bars onto your measurements. They will be big error bars.

From my perspective, I’d question the value of trying to quantify this stuff. I try to avoid pseudo-quant in qual research. The cost always seems too high to me, considering I’m collecting measures that may or may not correlate with real customer and business outcomes.

On the other hand, I do understand the pressure to deliver such things sometimes. It can feel reassuring to leadership.


#5

Thanks for your response Tom! The good news is that I’m not running small n size tests, and I’m not tooo focused on the mean. I did some work previous to this which looked at empirical demonstrations of the positive skew from geometric mean vs normal mean.

I think that work has lead me to not put toooo much emphasis on the mean or the geometric mean, because in my tests the geometric mean moved around just as much as the normal mean. It might have been a nicer, more correct number, but it wasn’t useful for analysis with n=25. YOu’re right about those error bars!

I found a lot more value in looking at visual representations of the spread of data, specifically histogram charts, at this sample size.

Now that I’m doing remote tests, I can run tests with n=50 or n=100 in a matter of hours, which has led me to look at this kind of analysis again, but with much more data. I’m pretty sold now on the fact that CES moves around too much to be useful for measuring subtle changes, but looking at histograms and the spread of data has been very insightful.

I’m not sure that this is ‘psuedo-quant’ when I can put 500 people in front of 5 tests and compare the data. What I am struggling with is finding the right balance of metrics that tell a good story about small changes at this sample size.

I’m publishing an article on it very soon so I’m hoping that a) I explain myself a bit better :sweat_smile: and b) we can get some discussion going about it because as a tool for interaction design it’s absolutely amazing to get this much data so quickly.


#6

Please share it here once you’re done with it! @dos4gw


#7

Started a new thread to keep things clean - check it out here: Case study: designing your UI for feature discovery with user research data