This lesson covers the mistakes that often look harmless in the moment but can quietly destroy decision quality.

Twyman’s Law: Too Good to Be True

When a result looks extraordinary, suspect instrumentation before celebrating.

Signals:

  • Unrealistically large uplift
  • Extremely tiny p-values with weak business rationale
  • Sudden jumps inconsistent with historical variance

Action:

  • Audit event definitions and pipelines
  • Validate segmentation and exclusion logic
  • Replicate before full rollout

Underpowered Experiments

A non-significant result from a low-power test is usually not evidence of no effect.

Action:

  • Quantify achieved power post-hoc
  • Inspect confidence interval width
  • Decide whether to extend, redesign, or stop with explicit uncertainty

Peeking and Early Stopping

Repeatedly checking results and stopping at first significance inflates false positives.

Safer alternatives:

  • Fixed-horizon analysis
  • Sequential methods designed for continuous monitoring
  • Pre-registered decision thresholds

Overdue Experiments

Running far beyond planned horizon can invite p-hacking and context drift.

Risks:

  • Seasonality shifts
  • Product environment changes
  • Stakeholder pressure to force significance

Action:

  • Anchor decisions to the preplanned analysis window
  • Treat extension as a new, documented phase

Discipline Framework

Before launch, define:

  • Sample size and duration
  • Primary endpoint
  • Stopping rule
  • Handling for anomalies and monitoring cadence

Strong process discipline prevents weak inference.

Key Takeaways

  • Extraordinary results require extraordinary validation
  • Non-significance without power context is incomplete
  • Monitoring discipline is a core part of experiment design

In Part 6, we combine everything into a practical decision framework: Trust, Implement, Follow-up.