Ecommerce Experimentation Infrastructure

A/B testing with
experimental integrity.

Check enough metrics and one wins by luck. Peek daily and stop on a good day. Slice by segment until something shines. You can make almost any test look like a winner. These tools protect the internal validity of your experiments, so that when you call something a win, you can defend it in any room.

● Live
Two-proportion z-test
SRM detection
Bonferroni correction
Peeking detection
Survival analysis
Platform calibration
Bayesian shrinkage
// 07 tools · all client-side · no data leaves your browser
The workflow
01 — platform calibration
assignment split control 53.1% variant 46.9% !
PVL
Platform Validator
Verify your A/B platform is assigning and tracking users correctly before you trust any results. Surfaces assignment SRM, asymmetric tracking loss, and consent-driven coverage gaps.
Assignment SRM via chi-squared test
Asymmetric tracking loss detection
Suspicious coverage flag
Metric balance check (Bonferroni-adjusted)
Platform trust score 0–100
02 — experiment governance
significance test p=.031 α
LBX
Lockbox
Pre-register your experiment before it runs. Lock in your hypothesis, sample size, and success metrics. Prevents p-hacking, segment fishing, and post-hoc goalpost moving.
Sample size calculator with MDE & power
Two-proportion z-test results engine
Peeking & underpowered detection
Causal integrity sidebar
Experiment history in localStorage
03 — time-to-conversion
time-to-conversion day 0 day 14
SRV
Survival Curves
Conversion rate tells you who converts. This shows how fast, and whether the speed difference between variants is statistically real. A variant that converts faster is worth money even when the final rates look identical.
Kaplan-Meier survival curves
Log-rank significance test
Median time-to-conversion per variant
Hazard ratio with confidence interval
CSV cohort data import
04 — winner's curse & bayesian analysis
observed vs honest lift 0% +18%? +6%
RCK
Reality Check
Your test came back a winner, but by how much, really? Underpowered tests that reach significance systematically overstate the effect. Deflate your result before you announce it, and see the probability your variant is actually better.
Winner's curse detection & shrinkage estimation
Bayesian P(B beats A) with sceptical priors
95% credible interval on true lift
Naive vs honest revenue projection
Frequentist vs Bayesian explainer for ecom
05 — experiment integrity report
stamped readout A integrity
RCP
Test Receipt
Anyone can screenshot a dashboard and call it a win. A receipt proves the win was earned: a stamped, printable integrity report covering registration, calibration, attestations, and the honest effect estimate.
Imports registered experiments from Lockbox
7 integrity attestations, weighted A–F grade
Honest (shrunk) estimate + P(B beats A)
SHA-256 fingerprint — tamper-evident
Print-ready — attach to any test readout
06 — claimed vs realized
claimed vs actual claimed +40% actual +6%
LGR
Program Ledger
"You announced +40% cumulative lift this year. Why is revenue flat?" Log every shipped winner, enter your actual monthly numbers, and see whether the wins are showing up in reality. No platform builds this view, because it audits them too.
Ledger of shipped winners — claimed & honest lift
Monthly actuals from your analytics
Claimed vs honest vs actual CVR trajectory
Program realization rate
JSON export / import for backup
07 — subscription ltv & test valuation
subscriber retention → value break-even 4.2 mo 100% active
SUB
Subscriber Value
When the test goal is subscriptions, your analytics counts a signup as one order and quietly buries your best variant. Model subscriber LTV in scenarios, get the exchange rate against one-off orders, and value the trade-off — even with zero subscription data.
LTV in three churn scenarios, not one guess
1 subscriber = X one-off orders exchange rate
Value-per-visitor comparison across arms
Break-even lifetime — no churn data needed
Industry churn benchmarks built in
// why this exists

Most ecommerce A/B testing is governance theatre. Teams pick their metric after seeing the results, peek at significance daily, and ship winners produced by platforms nobody ever bothered to calibrate. The math was never the problem.

Each tool here guards a different failure point. The Validator checks whether your testing platform is telling you the truth. Lockbox locks your hypothesis in before any data exists. Reality Check deflates inflated winners before you announce them, and the Ledger asks the uncomfortable year-end question: did any of it actually show up in revenue?

Everything runs in your browser. There are no accounts and nothing gets sent to a server, which also means you can use these on client data without asking anyone's permission.