You ran a test and it won. Here is the uncomfortable math: in an underpowered test, real effects are usually too small for the noise to let them reach significance. The only way a modest true effect passes a noisy test is by getting a lucky bounce that makes it look bigger than it is. So the tests that "win" are precisely the ones whose measured lift is inflated. Statisticians call it the winner's curse, and it punishes small samples hardest.
Science measured this on itself. In 2015, the Open Science Collaboration reran 100 psychology studies that had already passed peer review. On the second run, the average effect was roughly half the published size. The originals weren't fabricated. They were selected: journals printed the lucky bounces. Every time you announce the observed lift of a barely-significant test, you are running the same selection process on your own company.
Reality Check deflates your observed lift with Bayesian shrinkage, shows the probability your variant is better at all, and turns the honest estimate into an honest revenue projection. The first number you announce becomes the anchor. Make sure it's one reality can live up to.
"Extraordinary claims require extraordinary evidence."
CARL SAGAN — astronomer and science communicator
From your four inputs: p_A = c_A/n_A, p_B = c_B/n_B, observed absolute difference d = p_B − p_A. Its standard error:
SE is the size of a typical random fluctuation in d. When SE is large relative to d, the data alone can't pin the effect down — that's where the prior starts to matter.
The prior encodes "before seeing this test, what lifts are plausible?" as a normal distribution centred on zero effect:
Centred on zero because most ecom tests move nothing; the width says how surprised you'd be by a big true effect. This must be chosen before looking at results — choosing the prior that flatters your result is p-hacking with extra steps.
With a normal prior and (approximately) normal data, the posterior has a closed form: a precision-weighted average of prior and data. Precision = 1/variance.
Read it as a tug-of-war: when your sample is large, SE is tiny, the data term dominates, and μ_post ≈ d (barely any deflation). When the sample is small, the prior pulls the estimate toward zero. The "inflation shaved off" figure is 1 − μ_post/d. This is the same mechanism used in empirical-Bayes estimation at large experimentation platforms.
Unlike a confidence interval, the credible interval means what people think it means: given the data and prior, there's a 95% probability the true lift is inside it.
Shown so you can watch it disagree with the Bayesian read: an underpowered test can be "significant" while P(B>A) stays unconvincing under a sceptical prior. That disagreement is the winner's curse being caught in real time.
Same arithmetic, three different effect estimates. The gap between naive and honest is what your slide deck would have over-promised.
There's a recurring argument — on Reddit, in CRO communities, everywhere — that A/B testing is "statistically not viable" for most ecommerce stores. The math behind the complaint is real: a store with 30,000 monthly sessions and a 2% conversion rate needs roughly 8–12 weeks per test to detect a 10% relative lift with classical (frequentist) statistics. Most stores don't have that patience, so they either stop testing or — worse — run underpowered tests and trust the results anyway.
But the conclusion "testing isn't viable" is wrong. What's not viable is using the wrong statistical framework for the traffic you have.
Classical testing answers a strange question: "If there were truly no difference between A and B, how surprising is my data?" That's what a p-value is. It never tells you the probability that B is better — it tells you how weird your data would look in a hypothetical world where B does nothing.
This framework was designed for agriculture experiments in the 1920s, where you plant a field once and analyse once. It demands a fixed sample size decided upfront, forbids looking at results early (peeking inflates false positives dramatically), and gives a binary significant/not-significant answer that's routinely misread. For a low-traffic store, this is brutal: you commit to 10 weeks blind, you can't stop early even when the signal is obvious, and at the end you get a yes/no instead of a decision-ready number.
Bayesian analysis answers the question you actually have: "Given the data I've seen, what's the probability that B is better than A — and by how much?"
That statement is directly usable in a business decision. You can weigh it against implementation cost, risk appetite, and opportunity cost — the way you'd weigh any other business decision under uncertainty.
1. Traffic reality. Bayesian inference doesn't collapse below a magic sample size. With 2,000 visitors per arm you get a wider, more honest credible interval instead of a meaningless "not significant". The evidence you have is quantified, not discarded.
2. Business decisions are already Bayesian. No merchant thinks in terms of rejecting null hypotheses. They think "how confident am I this works, and what does it cost me if I'm wrong?" — which is literally the Bayesian decision framework (expected loss). The statistics should match the decision.
3. Priors are honesty, not cheating. A decade of published CRO data shows most ecom tests move conversion by very little — big lifts of +20% are rare and usually don't replicate. Encoding this as a sceptical prior automatically deflates too-good-to-be-true results. That's the winner's curse protection built directly into the math, which is exactly what the tool above does.
4. You can stop when you know enough. With a pre-agreed decision rule (e.g. "ship when P(B>A) ≥ 95% and the expected loss of shipping is below 0.1pp"), monitoring continuously is legitimate. For a store that can't afford 12-week tests, this typically cuts test duration by 30–50% versus fixed-horizon testing.
The bottom line: if you have Amazon-scale traffic, frequentist testing works fine when you run it by the book. If you're a normal ecom store, Bayesian methods are simply the correct way to make decisions under the uncertainty you actually have. What's not viable is pretending your underpowered frequentist test gave you certainty. That's what this tool is here to check.