LOCKBOX

Scientific discipline for A/B testing. Design your experiment upfront, enter results when ready, and get a rigorous verdict with every guardrail built in.

// why lock anything in a box?

The first rule of research is that the questions must be fixed before the answers arrive. If you choose your success metric, your sample size, and your stopping point after seeing the data, you can make almost any test look like a winner. Check enough metrics and one will be significant by luck. Peek at results daily and stop on a good day. Slice by segment until something shines. None of that is fraud in the moment. It just quietly guarantees that what you ship is noise.

Science had this exact disease and found the cure: pre-registration. Since 2005, medical journals refuse to publish clinical trials that didn't register their hypothesis and primary outcome before collecting data. The effect was brutal. In one famous analysis of large heart-disease trials, 57% reported positive results before registration became mandatory. After it, 8%. The drugs didn't get worse. The researchers just lost the ability to move the goalposts.

The Lockbox does for your A/B tests what trial registration did for medicine. You write down the hypothesis, the primary metric, the sample size, and the end date before the test runs. When the results come in, the tool checks you against your own commitments: did you run the full duration, did you keep the metric, are you powered enough to believe the answer. A win that survives the Lockbox is a win you can defend in any room.

"The first principle is that you must not fool yourself, and you are the easiest person to fool."

RICHARD FEYNMAN — physicist, Nobel laureate

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of."

RONALD FISHER — founder of modern experimental design

Experiments

// the math — every calculation this tool runs, in full

Required sample size — two-proportion normal approximation

Given baseline rate p₁, target rate p₂ = p₁ + MDE (converted to absolute percentage points), significance α and power 1−β:

n per variant = ( z₁₋α/₂ · √(2·p̄(1−p̄)) + z₁₋β · √(p₁(1−p₁) + p₂(1−p₂)) )² / (p₂ − p₁)²

z values come from the inverse normal CDF (Beasley-Springer-Moro / Abramowitz & Stegun approximation). Power is the probability of detecting the effect if it truly exists; α is the probability of a false alarm if it doesn't. If you enter MDE as a relative lift, it's converted first: absolute pp = baseline × relative% / 100.

Duration estimate

The smaller arm is the bottleneck, so with weekly traffic W and split share s of the smaller arm:

weeks = n / (W × s) · e.g. 50/50 split → s = 0.5

The "you can detect ≥X% in k weeks" table inverts the sample size formula by binary search: it finds the smallest MDE whose required n fits in the users you'd collect in k weeks.

Results — two-proportion z-test

With control (n₁, x₁) and variant (n₂, x₂), using the pooled rate p̂ = (x₁+x₂)/(n₁+n₂):

z = (p₂ − p₁) / √( p̂(1−p̂)(1/n₁ + 1/n₂) ) · p-value = 2·(1 − Φ(|z|))

The 95% confidence interval on the difference uses the unpooled standard error: diff ± 1.96·√(p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂). The test is two-sided: it doesn't assume in advance which way the effect goes.

Why peeking invalidates α

α = 0.05 means: if there is no real effect, a single look at the final data has a 5% chance of a false positive. Each additional look is another draw. Checking daily over a multi-week test pushes the real false positive rate toward 20–30% while your dashboard still says "95% confidence". That's why the tool flags any result entered before the pre-committed end date — the stated α no longer describes the actual risk taken.

Why underpowered wins are inflated

If power is low, the noise is large relative to the true effect. The only samples that cross the significance threshold are the ones where noise pushed the measurement up. So conditional on winning, the estimate is biased upward — the winner's curse. This tool flags underpowered results; Reality Check quantifies the correction.