Every conclusion you draw from an A/B test rests on one assumption: that the platform split your visitors randomly and counted them correctly. Nobody checks. An uncalibrated platform doesn't fail loudly. It hands you clean-looking numbers that are quietly wrong, and you ship decisions on them for years.
This is not a hypothetical. Microsoft audited its own experimentation platform and found that roughly 6% of experiments showed a sample ratio mismatch: the assignment split didn't match the configuration, which corrupts the test regardless of what the results say. That's one broken test in every sprint, at a company with a dedicated experimentation team. Any lab scientist calibrates the instrument before trusting the measurement. Your testing platform is the instrument.
One note on expectations: A/B platforms were never built to mirror GA4 or Adobe Analytics, so some gap between their numbers is normal. This tool doesn't check whether the numbers match perfectly. It checks whether the gaps that exist could compromise the validity of your test. Is the randomisation sound? Is tracking loss symmetric across groups? And is your coverage still a representative slice of your audience, or are you effectively testing on a self-selected minority?
"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."
ARTHUR CONAN DOYLE — Sherlock Holmes, A Scandal in Bohemia
This checks coverage, not validity. A gap between your platform and analytics totals is normal — consent rejection, ad-blockers, and tag timing all cause it. What matters for test validity is whether that gap is symmetric across Control and Variant, which is checked separately below. Use these numbers to understand reporting alignment, not to judge whether the test is trustworthy.
Enter your experiment numbers below. You want to see no significant difference between control and variant — that means your platform is assigning users fairly. A significant result here is a red flag: your platform is biased, not your variant winning.
Paste day-by-day tracked session counts for each group, comma-separated. Checks whether assignment was stable over time — a common sign of platform issues is one group getting disproportionately more traffic on specific days.
Given your configured split (e.g. 50/50) and the observed counts, we compute the expected count per arm and test whether the deviation is larger than randomness allows:
The p-value is the probability of seeing a deviation this large if the split were truly correct. Severity: p < 0.001 critical (−40 trust), p < 0.01 serious (−25), p < 0.05 moderate (−12). The same test runs twice: on assigned counts (Assignment SRM, a randomization failure) and on tracked counts (Tracking SRM, a pipeline failure). The distinction matters because the fixes are different.
Per arm, the tracking rate is tracked / assigned. The check compares the two rates:
Total loss doesn't matter for validity; the difference between arms does. Thresholds: >5pp high (−20), >2pp medium (−10), >1pp low (−5).
If you declare a consent rate c, the expected tracking rate per arm is ≈ c. If both arms track within 5pp of c, the loss is explained and no penalty applies (+10 restored). A residual gap beyond consent points to a deeper pipeline issue.
For rate metrics, each arm's conversions/sessions are compared with a pooled z-test:
For average-value metrics (AOV, RPV) an approximate test is used with an estimated SD of 1.5× the pooled mean, since raw variances aren't available from summary data. Because you may check several metrics, the significance threshold is Bonferroni-adjusted: α = 0.05 / number of metrics. Without this, checking 5 metrics gives you a ~23% chance of a false alarm somewhere.
Daily counts per arm form a 2×k table (k = days). The test asks whether the split ratio is the same every day:
Instability (p < 0.05, −15) usually means a cache flush, deployment, or campaign hit one arm harder on specific days.
Starts at 100. Deductions above are cumulative; the coverage gap (platform vs analytics totals) is deliberately excluded from the score because coverage is a reporting concern, not a validity concern. Bands: ≥85 trustworthy, ≥65 caution, ≥40 unreliable, below 40 do not trust.
Computed via the normal approximation for df=1 (p = 2·(1−Φ(√χ²))) and a Wilson-Hilferty transformation for higher df. Φ is the standard normal CDF, approximated with the Abramowitz & Stegun polynomial (max error ~1.5×10⁻⁷).