In Part 1, we defined the mechanics of an A/B test. Now we answer the hard question: when is a difference convincing?
P-Value in Plain Language
A p-value answers:
If the null hypothesis were true, how likely is data this extreme (or more extreme)?
Small p-values indicate that the observed difference is unlikely under $H_0$.
Common rule:
- If $p < 0.05$, reject $H_0$
- If $p \ge 0.05$, do not reject $H_0$
Important: p-value is not the probability that $H_0$ is true.
Confidence Intervals Add Magnitude
A 95% confidence interval for uplift gives a plausible range for the true effect.
Interpretation pattern:
- Interval fully above 0: likely positive effect
- Interval includes 0: inconclusive
- Interval fully below 0: likely negative effect
Confidence intervals are decision-friendly because they combine significance and practical impact.
Type I and Type II Errors
Every test design is a trade-off.
- Type I error ($\alpha$): false positive, detecting an effect that is not real
- Type II error ($\beta$): false negative, missing a real effect
- Power: $1 - \beta$, probability of detecting a true effect of a given size
Reducing one error usually increases the other unless you increase sample size.
One-Tailed vs Two-Tailed
Two-tailed tests are usually safer in product work because they detect both improvement and harm.
Use one-tailed tests only when:
- You pre-register the directional hypothesis
- Harm in the opposite direction is not operationally relevant
Common Misinterpretations
- “p = 0.03 means 97% chance treatment wins” (incorrect)
- “Non-significant means no effect” (incorrect, could be underpowered)
- “Significant means large impact” (incorrect, effect could be tiny)
Practical Decision Rule
For most product experiments:
- Require data quality checks to pass
- Use two-tailed test at predefined $\alpha$
- Evaluate 95% confidence interval for practical significance
- Decide based on impact, not only significance
Key Takeaways
- P-values quantify surprise under the null
- Confidence intervals quantify plausible effect size
- Sound decisions require both statistical and practical significance
In Part 3, you will learn how to choose sample size and power before launching a test.