Platform Validator

// why validate the platform?

Every conclusion you draw from an A/B test rests on one assumption: that the platform split your visitors randomly and counted them correctly. Nobody checks. An uncalibrated platform doesn't fail loudly. It hands you clean-looking numbers that are quietly wrong, and you ship decisions on them for years.

This is not a hypothetical. Microsoft audited its own experimentation platform and found that roughly 6% of experiments showed a sample ratio mismatch: the assignment split didn't match the configuration, which corrupts the test regardless of what the results say. That's one broken test in every sprint, at a company with a dedicated experimentation team. Any lab scientist calibrates the instrument before trusting the measurement. Your testing platform is the instrument.

One note on expectations: A/B platforms were never built to mirror GA4 or Adobe Analytics, so some gap between their numbers is normal. This tool doesn't check whether the numbers match perfectly. It checks whether the gaps that exist could compromise the validity of your test. Is the randomisation sound? Is tracking loss symmetric across groups? And is your coverage still a representative slice of your audience, or are you effectively testing on a self-selected minority?

"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

ARTHUR CONAN DOYLE — Sherlock Holmes, A Scandal in Bohemia

01 — Platform & Configuration

Testing Platform

Adobe Target

OPT

Optimizely

VWO

ABT

AB Tasty

Dynamic Yield

CVT

Convert

KAM

Kameleoon

???

Other / Custom

Expected Traffic Split (Control %)

Control

Variant

Test Type

Cookie Consent Rate (optional)

When does the platform assign visitors?

Test Duration (days)

02 — Traffic Assignment Check / Sample Ratio Mismatch

Coverage Check — Analytics vs Platform

This checks coverage, not validity. A gap between your platform and analytics totals is normal — consent rejection, ad-blockers, and tag timing all cause it. What matters for test validity is whether that gap is symmetric across Control and Variant, which is checked separately below. Use these numbers to understand reporting alignment, not to judge whether the test is trustworthy.

Sessions / visitors from analytics

Orders / transactions from analytics

Revenue from analytics

Date range must match platform

Enter the analytics totals above + platform numbers below to see the gap analysis.

▲ Control (A)

▲ Variant (B)

Visitors

Sessions
analytics (opt.)

Orders

Revenue

03 — Metric Balance Check

Enter your experiment numbers below. You want to see no significant difference between control and variant — that means your platform is assigning users fairly. A significant result here is a red flag: your platform is biased, not your variant winning.

04 — Temporal Distribution / optional

Paste day-by-day tracked session counts for each group, comma-separated. Checks whether assignment was stable over time — a common sign of platform issues is one group getting disproportionately more traffic on specific days.

Control — daily sessions (comma-separated)

Variant — daily sessions (comma-separated)

Platform Trust Score

— / 100

Awaiting Data

Enter traffic data to generate a trust score.

Diagnostic Checklist

–

Assignment SRM

–

Tracking discrepancy

–

Consent impact

–

Metric balance

–

Temporal stability

SRM Severity

Assignment imbalance

No data

Tracking loss

No data

// the math — every calculation this tool runs, in full

Sample Ratio Mismatch — chi-squared goodness-of-fit test

Given your configured split (e.g. 50/50) and the observed counts, we compute the expected count per arm and test whether the deviation is larger than randomness allows:

χ² = (obs_A − exp_A)² / exp_A + (obs_B − exp_B)² / exp_B · df = 1

The p-value is the probability of seeing a deviation this large if the split were truly correct. Severity: p < 0.001 critical (−40 trust), p < 0.01 serious (−25), p < 0.05 moderate (−12). The same test runs twice: on assigned counts (Assignment SRM, a randomization failure) and on tracked counts (Tracking SRM, a pipeline failure). The distinction matters because the fixes are different.

Asymmetric tracking loss

Per arm, the tracking rate is tracked / assigned. The check compares the two rates:

gap = |tracked_A/assigned_A − tracked_B/assigned_B| · in percentage points

Total loss doesn't matter for validity; the difference between arms does. Thresholds: >5pp high (−20), >2pp medium (−10), >1pp low (−5).

Consent model

If you declare a consent rate c, the expected tracking rate per arm is ≈ c. If both arms track within 5pp of c, the loss is explained and no penalty applies (+10 restored). A residual gap beyond consent points to a deeper pipeline issue.

Metric balance — two-proportion z-test with Bonferroni correction

For rate metrics, each arm's conversions/sessions are compared with a pooled z-test:

z = (p_B − p_A) / √( p̂(1−p̂)(1/n_A + 1/n_B) ) · p̂ = pooled rate

For average-value metrics (AOV, RPV) an approximate test is used with an estimated SD of 1.5× the pooled mean, since raw variances aren't available from summary data. Because you may check several metrics, the significance threshold is Bonferroni-adjusted: α = 0.05 / number of metrics. Without this, checking 5 metrics gives you a ~23% chance of a false alarm somewhere.

Temporal stability — chi-squared homogeneity test

Daily counts per arm form a 2×k table (k = days). The test asks whether the split ratio is the same every day:

χ² = Σ (obs − exp)² / exp over all cells · df = k − 1

Instability (p < 0.05, −15) usually means a cache flush, deployment, or campaign hit one arm harder on specific days.

Trust score

Starts at 100. Deductions above are cumulative; the coverage gap (platform vs analytics totals) is deliberately excluded from the score because coverage is a reporting concern, not a validity concern. Bands: ≥85 trustworthy, ≥65 caution, ≥40 unreliable, below 40 do not trust.

Chi-squared p-values

Computed via the normal approximation for df=1 (p = 2·(1−Φ(√χ²))) and a Wilson-Hilferty transformation for higher df. Φ is the standard normal CDF, approximated with the Abramowitz & Stegun polynomial (max error ~1.5×10⁻⁷).