I used to run a lot of A/B tests. Some of them were well run but many of them weren’t, particularly in the beginning. Through trial and error I was able to use them to iterate to increased conversions on the checkout flow of the website I worked on. In most other areas of the site I ran tests and saw little change but I still thought A/B testing was a handy tool in my data-driven UX design toolkit. After all books like The Lean Startup practically make a mantra of telling you to split test everything.
However, since becoming a freelance consultant to smaller companies and start-ups I stopped bothering with it, and as I’ve read more about the subject I’ve realised this is probably wise. It boils down to three main reasons, which I’ll expand on: time taken to get a result; not having the expertise; better testing options. If you work in marketing, product, or design at a small-medium sized company then you should think about doing the same. The promising theory of split testing everything to learn which is better doesn’t match the reality.
Note: I still believe measuring quantitative data is important and the likes of Google Analytics is very useful for metrics on how users really behave. It’s the problems presented by measuring one thing against another and making big decisions about which is better that aren’t to be taken lightly.
Statistical significance is often the thing that most most testing software and people running tests focus on but there’s an equally important thing that is overlooked: statistical power. Or running the test with enough traffic so the result is meaningful and not chance. A page conversion change from 5% to 5.5% (an increase of 10%) could mean thousands of dollars more to the business, but to be certain about that result 62,000 users would need to go through the A/B test.
If the test is on a page deep in your site like the checkout flow, you’d likely need a monthly unique user traffic of 10x that to get the test finished in one month. Even the decent-sized start-ups I work with don’t have that. And that’s to detect a fairly big change of 10%. If you want to detect a smaller change, the number of users required rises exponentially.
Leaving a test running for a month in a start-up environment is tough. That’s ages! So much can happen in that time. Can your team afford to wait for the results of that before designing something? Can you manage to avoid watching the results in that time and being spooked when they change? Can you resist the pressure from management to stop the test early and move onto something else? How will you spin it to stakeholders when after that month your test shows no difference (as many do)?
The other thing you should be doing in a well-run test is targeting it at specific traffic that is likely to behave consistently. If you’re showing a test to all your users and it’s running for a month then your results are likely to be skewed when the company buys in a load of paid traffic halfway through the test, who act differently to your normal traffic. Of course narrowing your traffic down means you’re now going to need to run your test even longer!
Statistical significance and statistical power are just the start of the things you should be caring about when running A/B tests. Are you comfortable with the concepts like false positives, p-values, binomial distributions, null hypothesis, A/A tests, two-tail tests, etc? If you’re from a design or marketing background then the answer is probably no. I know I wasn’t when I started running A/B tests. Do you understand what page 12 of this doc says? I certainly don't. Though I recommend reading the rest of that paper as it explains more on why so many A/B tests are badly done.
There’s no two ways around it, running A/B tests may have been made easy to do but in reality it means doing proper maths and unless you understand statistics, you’re leaving yourself wide open to running bad tests. Even if your test reaches the magical statistical significance of 95% or even 99%, the results are no guarantee of anything. Mainly for the reason of not running the test with a big enough sample size as I touched on above but also because on its own it just doesn’t mean much. In fact a lot of scientific journals aren’t accepting it as a meaningful indication of anything.
This is work that should be left to experts. Just like you wouldn’t get your copywriter to write your backend code, if you’re serious about running a proper programme of A/B testing then you should be investing in a data scientist to do it. Don’t give it to the marketing intern.
I’m now a firm believer that time in small/medium companies is better spent running user tests than A/B tests. Even when I was running A/B tests, most of my ideas for what to test were coming from user testing. They’re such rich sources of data on what users actually do and think when using your product that you can’t get any other way. At best, even a well-run user test tells you something about a single page or a single element. A well-run user test can tell you about the whole flow of a site and what people think of lots of different elements.
Even if you A/B test an old design of a page against a completely new one and you see the new one improves conversion, what have you learned to take forward to future designs? What were the parts that users loved? Because if you’re not careful you can end up interpreting the wrong things.
Qualitative data like user testing obviously doesn’t have the pressures of reaching the high levels of statistical rigour of quant testing. However if five people are saying they can’t find your search box then you don’t need it. The power of watching users struggle or give up tells you much more than any numbers can. And you only need to see a user discover a bug once to know that it needs fixing.
If you’re in the early days of a business or have a small amount of traffic then a programme of user testing will help you shape your offering and give your customers what they want, quicker and easier than A/B tests can. With remote testing anyone can feasibly be running user tests for a day every two weeks, potentially after each major release. You’ll have lots of rich ideas for what to change and knowledge as to why you should be changing them.
I’m not saying A/B testing can’t work. A considered and targeted programme run by a data scientist who understands statistics on a site with lots of traffic is likely to be an important part of any major business. If you don’t fall into that category then you should put your efforts elsewhere. Next time you see one of those start-up blog posts claiming they just increased sign-ups by 50% by testing one simple change, then be very suspicious. If they don’t publish their results and tell you how much traffic they had, then you can probably ignore it. Many A/B test result claims are too good to be true.
Sign up to my mailing list and you'll receive my guide to tools for evidence-based design AND my ecommerce UX cheat sheet.
A step-by-step guide to my process and the tools I use at different stages when running evidence-based UX design projects for clients.
Where I share my top advice for doing effective user testing on the platform usertesting.com including smart recruiting and getting good results when you can't meet the user in-person.
Here's how I carry out an end-to-end guerrilla user test, including preparation time, running the test, and importantly analysing the results. In under four hours.