← Home Platform Validator PVL
01 — Platform Calibration

Platform Validator

// why validate the platform?

Every conclusion you draw from an A/B test rests on one assumption: that the platform split your visitors randomly and counted them correctly. Nobody checks. An uncalibrated platform doesn't fail loudly. It hands you clean-looking numbers that are quietly wrong, and you ship decisions on them for years.

This is not a hypothetical. Microsoft audited its own experimentation platform and found that roughly 6% of experiments showed a sample ratio mismatch: the assignment split didn't match the configuration, which corrupts the test regardless of what the results say. That's one broken test in every sprint, at a company with a dedicated experimentation team. Any lab scientist calibrates the instrument before trusting the measurement. Your testing platform is the instrument.

One note on expectations: A/B platforms were never built to mirror GA4 or Adobe Analytics, so some gap between their numbers is normal. This tool doesn't check whether the numbers match perfectly. It checks whether the gaps that exist could compromise the validity of your test. Is the randomisation sound? Is tracking loss symmetric across groups? And is your coverage still a representative slice of your audience, or are you effectively testing on a self-selected minority?

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

ARTHUR CONAN DOYLE — Sherlock Holmes, A Scandal in Bohemia

01 — Platform & Configuration
AT
Adobe Target
OPT
Optimizely
VWO
VWO
ABT
AB Tasty
DY
Dynamic Yield
CVT
Convert
KAM
Kameleoon
???
Other / Custom
%
Control
Variant
%
02 — Traffic Assignment Check / Sample Ratio Mismatch
Coverage Check — Analytics vs Platform

This checks coverage, not validity. A gap between your platform and analytics totals is normal — consent rejection, ad-blockers, and tag timing all cause it. What matters for test validity is whether that gap is symmetric across Control and Variant, which is checked separately below. Use these numbers to understand reporting alignment, not to judge whether the test is trustworthy.

Enter the analytics totals above + platform numbers below to see the gap analysis.
▲ Control (A)
▲ Variant (B)
Visitors
Sessions
analytics (opt.)
Orders
Revenue
03 — Metric Balance Check

Enter your experiment numbers below. You want to see no significant difference between control and variant — that means your platform is assigning users fairly. A significant result here is a red flag: your platform is biased, not your variant winning.

04 — Temporal Distribution / optional

Paste day-by-day tracked session counts for each group, comma-separated. Checks whether assignment was stable over time — a common sign of platform issues is one group getting disproportionately more traffic on specific days.

Platform Trust Score
/ 100
Awaiting Data
Enter traffic data to generate a trust score.
Diagnostic Checklist
Assignment SRM
Tracking discrepancy
Consent impact
Metric balance
Temporal stability
SRM Severity
Assignment imbalance
No data

Tracking loss
No data
// the math — every calculation this tool runs, in full

Sample Ratio Mismatch — chi-squared goodness-of-fit test

Given your configured split (e.g. 50/50) and the observed counts, we compute the expected count per arm and test whether the deviation is larger than randomness allows:

χ² = (obs_A − exp_A)² / exp_A + (obs_B − exp_B)² / exp_B  ·  df = 1

The p-value is the probability of seeing a deviation this large if the split were truly correct. Severity: p < 0.001 critical (−40 trust), p < 0.01 serious (−25), p < 0.05 moderate (−12). The same test runs twice: on assigned counts (Assignment SRM, a randomization failure) and on tracked counts (Tracking SRM, a pipeline failure). The distinction matters because the fixes are different.

Asymmetric tracking loss

Per arm, the tracking rate is tracked / assigned. The check compares the two rates:

gap = |tracked_A/assigned_A − tracked_B/assigned_B|  ·  in percentage points

Total loss doesn't matter for validity; the difference between arms does. Thresholds: >5pp high (−20), >2pp medium (−10), >1pp low (−5).

Consent model

If you declare a consent rate c, the expected tracking rate per arm is ≈ c. If both arms track within 5pp of c, the loss is explained and no penalty applies (+10 restored). A residual gap beyond consent points to a deeper pipeline issue.

Metric balance — two-proportion z-test with Bonferroni correction

For rate metrics, each arm's conversions/sessions are compared with a pooled z-test:

z = (p_B − p_A) / √( p̂(1−p̂)(1/n_A + 1/n_B) )  ·  p̂ = pooled rate

For average-value metrics (AOV, RPV) an approximate test is used with an estimated SD of 1.5× the pooled mean, since raw variances aren't available from summary data. Because you may check several metrics, the significance threshold is Bonferroni-adjusted: α = 0.05 / number of metrics. Without this, checking 5 metrics gives you a ~23% chance of a false alarm somewhere.

Temporal stability — chi-squared homogeneity test

Daily counts per arm form a 2×k table (k = days). The test asks whether the split ratio is the same every day:

χ² = Σ (obs − exp)² / exp  over all cells  ·  df = k − 1

Instability (p < 0.05, −15) usually means a cache flush, deployment, or campaign hit one arm harder on specific days.

Trust score

Starts at 100. Deductions above are cumulative; the coverage gap (platform vs analytics totals) is deliberately excluded from the score because coverage is a reporting concern, not a validity concern. Bands: ≥85 trustworthy, ≥65 caution, ≥40 unreliable, below 40 do not trust.

Chi-squared p-values

Computed via the normal approximation for df=1 (p = 2·(1−Φ(√χ²))) and a Wilson-Hilferty transformation for higher df. Φ is the standard normal CDF, approximated with the Abramowitz & Stegun polynomial (max error ~1.5×10⁻⁷).