Category: UX advice / Research articles
Once upon a time I ran a lot of A/B tests. Some of them were well run but many of them weren’t, particularly in the early days.
Through trial and error I was able to use A/B testing to iterate to increased conversions on the checkout flow of the website I worked on. In most other areas of the site I ran tests and saw little change but I still thought it had to be worth doing. After all books like The Lean Startup practically make a mantra of telling you to split test everything.
Since becoming a freelance consultant to small companies and startups I have stopped using it as a tool, and as I’ve read more about the subject I’ve realised this is probably wise. It boils down to three main reasons, which I’ll expand on: time taken to get a result; not having the expertise; better testing options.
If you work in marketing, product, or design at a small-medium sized company then you should think about swerving A/B testing. The promising theory of split testing everything to learn which is better just doesn’t match the reality.
Note: I still believe measuring quantitative data is important and the likes of Google Analytics is very useful to find metrics on how users really behave. It’s the problems presented by measuring one thing against another and making big decisions about which is better that aren’t to be taken lightly.
Statistical significance is often the thing that most most testing software and people running tests focus on but there’s an equally important thing that is overlooked: statistical power. This means running the test with enough traffic to be sure the result is not chance.
If you wanted to run a test to see page conversion change from 5% to 5.5% (an increase of 10%) could mean thousands of dollars more to the business, but to be certain about that result 62,000 users would need to go through the A/B test.
If this test is on a page deep in your site like the checkout flow, you’d likely need monthly unique user traffic to your website of 10x that to get the test finished in one month. Even the decent-sized startups I work with don’t have that kind of traffic. And that’s to detect a fairly big change of 10%. If you want to detect a smaller change, the number of users required rises exponentially.
Leaving a test running for a month in a start-up environment is tough. That’s ages! So much can happen in that time. Can your team afford to wait for the results of that before designing something? Can you manage to avoid watching the results in that time and being spooked when they change? Can you resist the pressure from management to stop the test early and move onto something else? How will you spin it to stakeholders when after that month your test shows no difference (as many do)?
Of course you could take a punt and stop your tests a bit early but this often causes false results that would not be present were the test to run longer. This is a common problem that confuses people when their new feature launch doesn’t show the same success as the test.
The other thing you should do in a well-run test is target it at specific traffic that behaves consistently. If you’re showing a test to all of your users and it’s running for a month then your results are likely to be skewed when the company buys in a load of paid traffic halfway through the test, who act differently to your normal traffic. Of course narrowing your traffic down means you’re now going to need to run your test even longer!
Statistical significance and statistical power are just the start of the things you should be caring about when running A/B tests. Are you comfortable with the concepts like false positives, p-values, binomial distributions, null hypothesis, A/A tests, two-tail tests, etc?
If you’re from a design or marketing background then the answer is probably no. I know I wasn’t when I started running A/B tests. Do you understand what page 12 of this paper says? I certainly don't. Though I recommend reading the rest of that paper as it explains more on why so many A/B tests are badly done.
There’s no two ways around it, running A/B tests may have been made easy to do with software but in reality it means doing proper maths and unless you understand statistics, you’re leaving yourself wide open to running bad tests. Even if your test reaches the magical statistical significance of 95% or even 99%, the results are no guarantee of anything.
On its own statistical significance just doesn’t mean much. In fact a lot of scientific journals aren’t accepting it as a meaningful indication of anything.
This is work that should be left to experts. Just like you wouldn’t get your copywriter to write your backend code, if you’re serious about running a proper programme of A/B testing then you should be investing in a data scientist to do it.
I’m now a firm believer that time in small/medium companies is better spent running user tests than A/B tests. Even when I was running A/B tests, most of my ideas for what to test were coming from user testing.
They’re such rich sources of data on what users actually do and think when using your product that you can’t get any other way. At best, even a well-run user test tells you something about a single page or a single element. Meanwhile a well-run user test can tell you about the whole flow of a site and what people think of lots of different elements.
Even if you A/B test an old design of a page against a completely new one and you see the new one improves conversion, what have you learned to take forward to future designs? What were the parts that users loved? Because if you’re not careful you can end up interpreting the wrong things.
Qualitative evidence like user testing obviously doesn’t have the pressures of reaching the high levels of statistical rigour of quantitative testing. However if five people are saying they can’t find your search box then you don’t need statistical significance to know something needs to change. The power of watching users struggle or give up tells you much more than any numbers can. And you only need to see a user discover a bug once to know that it needs fixing.
If you’re in the early days of a business or have a small amount of traffic then a programme of user testing will help you shape your offering and give your customers what they want, quicker and easier than A/B tests can. With remote user testing anyone can feasibly be running user tests for a day every two weeks, potentially after each major release. You’ll have lots of rich ideas for what to change and knowledge as to why you should be changing them.
I’m not saying A/B testing can’t work. A considered and targeted programme run by a data scientist who understands statistics on a site with lots of traffic is likely to be an important part of any major business. If your business doesn’t fall into that category then you should put your efforts elsewhere.
Next time you see one of those start-up blog posts claiming they just increased sign-ups by 50% by testing one simple change, then be very suspicious. If they don’t publish their results and tell you how much traffic they had, then you can probably ignore it. Many A/B test result claims are too good to be true.
You can learn how I redesign websites without the need for A/B testing in my course, The Evidence-Based Redesign.
Sign up here to get a guide to my favourite (mostly free) tools for evidence-based designing. Plus a massive, advice-filled reading list.You'll also get my new articles & content emailed to you every couple of weeks. Your email is never shared. Unsubscribe at any time.