Many teams run experiments that cannot detect meaningful effects. This is a planning failure, not an analysis failure.

What Is Statistical Power?

Power is the probability of detecting a real effect of a specific size:

\[\text{Power} = 1 - \beta\]

Typical target: 80% or 90%.

Low power means you can easily miss wins and wrongly conclude “no effect”.

Inputs for Sample Size

To calculate required sample size per variant, specify:

  • Baseline conversion rate
  • Minimum detectable effect (MDE)
  • Significance level $\alpha$
  • Desired power

Smaller MDE requires much larger sample.

MDE: The Product Lens

MDE is the smallest effect worth detecting.

Set MDE based on business value, not wishful thinking.

  • If MDE is too small, tests become impractically long
  • If MDE is too large, you may miss meaningful improvements

A good MDE balances decision value and experiment cost.

Duration Planning

If required sample is 20,000 users total and traffic is 1,000 users/day, run at least 20 days.

Do not stop early just because interim results look promising.

Why Underpowered Tests Waste Time

Underpowered tests increase:

  • False negatives
  • Unstable conclusions
  • Team disagreement and retest cycles

They often produce inconclusive outcomes that still consume engineering and opportunity cost.

Practical Planning Checklist

  • Define primary metric and MDE
  • Compute sample size before launch
  • Estimate duration from traffic realistically
  • Freeze analysis plan (including stopping rule)
  • Share assumptions with stakeholders

Key Takeaways

  • Power planning is mandatory for credible experimentation
  • MDE should reflect real business decisions
  • No sample-size plan usually means unreliable outcomes

In Part 4, we move from theory to common failure modes that invalidate otherwise well-planned tests.