Theres a 5% chance that these results are total bullshit « Metric Sparrow | Jesse Avshalomov

Warning! This post has fallen victim to the

base rate fallacy

and needs to be amended. While the overall message of ‘push for greater statistical significance’ still rings true, the statistical conclusions depicted below are dubious at best. My apologies for spreading a popular misconception, and thank you to

Peep Laja

and

Ryan Thompson

for helping to shed light in the darkness. Updated article coming soon.

In the world of A/B testing, determining statistical significance can be a precarious business. Were taking techniques that were designed for static sample sizes and applying them to continuous datasets a context they were never intended for. Of course there are going to be issues.

Compounding that complexity, Im sure many of you have had encounters wherein teammates wanted to call the result of an A/B test early, before it was really done baking.

This is why writers like Evan Miller have penned posts on the perils of A/B testing, and companies have gone so far as to rebuild traditional statistics to produce more bulletproof testing readouts.

Its why people argue for A/A testing to illustrate these dangers, stirring others to rebel against A/A testing as wasted testing time.

Yet even with these lessons and safeguards in place, so long as testing software continues to visualize anecdotal results, there will come a day when youre asked to call a test prematurely.

And itll be tempting too! The acclaim that comes with a successful test, the unconscious desire to see your hypothesis confirmed, the pressure to show results its enough to bias the very best.

But is that test youre running really a winner?

As guardians of the sanctity of our test results, we have to be strong when no one else will. So, from one guardian to another, allow me to unveil my favorite new tool in this struggle: a simple rhetorical device to reframe the current state of test results.

Stop saying: Weve reached 95% statistical significance.

And start saying: Theres a 5% chance that these results are total bullshit.

Seems harsh at first, right?

But both statements are equally accurate, the later just reminds us of the truth: that 95% statistical significance means theres still a 1-in-20 chance that our seemingly victorious experiment is actually the result of random variation.[1]

Let me put that another way: If youre running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong. You are not being a jerk by reminding people of this, you are doing them a favor, even if it doesnt feel like it.

When your colleagues and co-founders are make decisions based on these test results, theyre depending on you to serve as a lense of truth. They deserve to understand the test results they see, and they deserve to work from numbers that they can truly trust.

Yes it will take longer. Yes we all want to move fast. Yes 95% is the industry standard.

Im here to tell you that you can and should do better than the industry standard. Youre not in this line of work to provide the industry standard. Being wrong 1 time in 100 is a radically better outcome for your company than being wrong 1 time in 20, especially when youve got million dollar decisions on the line.[2]

There are thousands of choicesyoull have to make with insufficient data, but this isnt one of them. You owe it to your team to achieve superior clarity, all youre asking for is a little patience to get the job done right.

Put your own personal spin on this rhetorical technique if you need to, butgive it a try — it’sa quick way to remind stakeholders of the fallibility of our A/B tests and push for a higher degree of statistical confidence (hello 99%).

Thats a result we can be proud of, thats a result we can get behind, thats a result we can build a culture of continuous testing on.

So say it with me, one more time: “Theres a 5% chance that these results are total bullshit.”

Good luck out there, and be strong.

[1] Thats assuming youre running at high statistical power with a single variant! Nevermind if you raise your value or increase the number of experimental variations, the likelihood of Type I and Type II errors mounts very very quickly, and youll be getting yourself into all sorts of statistical nastiness. [2] I will of course acknowledge that many companies do not have the leisure of reaching the highest levels of statistical significance due to traffic constraints. If this is the case for your company, its still incredibly important to emphasize the accuracy or inaccuracy of whatever test results you achieve.