20 · Rollout Strategies
Deploying a new version of an application is a controlled replacement of running processes. The question is not whether to replace them, but how — how quickly, how safely, and how reversibly.
Rollout strategy is the answer to that question.
The Problem with Hard Cutover
The naive approach is to stop the old version and start the new one. This is simple to reason about but produces a window of unavailability between the two states. For most production systems, that window is unacceptable.
The strategies below all attempt to close that window — each with different trade-offs between safety, speed, resource cost, and operational complexity.
RollingUpdate — The Default
Replace old pods gradually, a few at a time, while the service continues to handle traffic.
Before: [v1] [v1] [v1] [v1]
Step 1: [v1] [v1] [v1] [v2] ← one replaced
Step 2: [v1] [v1] [v2] [v2]
Step 3: [v2] [v2] [v2] [v2] ← done
Two parameters control the pace:
- maxSurge — how many extra pods may exist above the desired count during the rollout. A surge of 1 means the cluster temporarily runs
n+1pods. - maxUnavailable — how many pods may be absent below the desired count. Setting this to 0 guarantees full capacity is maintained throughout.
The combination maxSurge: 1, maxUnavailable: 0 is the safest default: the new pod starts and passes health checks before the old one is removed.
Both versions run simultaneously
During a rolling update, old and new pods serve traffic at the same time. The application must be backwards compatible — database schema changes, API contracts, and message formats must remain valid across both versions.
Recreate — Accept the Downtime
Stop all old pods, then start all new pods.
This is the right strategy when backwards compatibility is impossible — a breaking schema migration, a protocol change, or a stateful initialisation that cannot run alongside the old version.
The downtime is deliberate and bounded. It is operationally honest: the system acknowledges it cannot guarantee continuity and makes the transition window visible.
Blue/Green — Instant Switch
Run two complete environments in parallel. The active one (blue) serves all traffic. The new one (green) is deployed and validated while idle. Traffic switches in a single atomic operation.
Blue (active): [v1] [v1] [v1] [v1] ← all traffic here
Green (idle): [v2] [v2] [v2] [v2] ← fully deployed, no traffic
↓ switch
Blue (idle): [v1] [v1] [v1] [v1] ← kept for rollback
Green (active): [v2] [v2] [v2] [v2] ← all traffic here
The key property is instant rollback: if the new version behaves unexpectedly, traffic reverts to blue in one operation. No re-deployment is needed.
The cost is resource duplication. Running two full environments doubles the infrastructure footprint for the duration of the validation window. This is acceptable when the cost of a bad deployment is high enough to justify it.
Blue/green requires two complete healthy environments before switching
The switch is only safe if green is fully ready. The validation period — observing metrics, running smoke tests, checking error rates — happens before the switch, not during it.
Canary — Gradual Traffic Shift
Route a small percentage of traffic to the new version first. Observe. Expand if healthy. Roll back if not.
Step 1: 90% → [v1] 10% → [v2] ← small exposure
Step 2: 50% → [v1] 50% → [v2] ← if metrics look good
Step 3: 0% → [v1] 100% → [v2] ← fully promoted
Canary is the most operationally complex strategy but the most information-rich. It exposes the new version to real production load and real user behaviour before full rollout. Issues surface in a controlled blast radius.
The critical question canary answers that no pre-production environment can: does this version behave correctly under real traffic?
Traffic shifting can be implemented in two ways:
- Replica-based: the percentage of canary pods approximates the traffic split. Imprecise — a 25% canary with 4 replicas means 1 pod receives all "canary" traffic, but distribution is not guaranteed.
- Traffic-weighted: a service mesh (Istio) or gateway routes requests by weight independent of replica count. Precise — 10% of requests go to the canary regardless of how many pods are running.
Choosing a Strategy
| Strategy | Downtime | Rollback speed | Resource cost | Complexity |
|---|---|---|---|---|
| RollingUpdate | None | Re-deploy | Minimal | Low |
| Recreate | Yes | Re-deploy | Minimal | None |
| Blue/Green | None | Instant | 2× during switch | Medium |
| Canary | None | Immediate | Modest | High |
The strategies are not ranked — each fits a different risk profile:
- RollingUpdate is the right default for most stateless services with backwards-compatible changes.
- Recreate is correct when version coexistence is impossible, and downtime is preferable to data corruption.
- Blue/Green is appropriate when validation must happen before any user sees the change, and rollback speed is critical.
- Canary is appropriate when the cost of a bad deployment is high but downtime is unacceptable, and the team has the observability to make meaningful decisions during the rollout.
Canary without observability is just a slow rollout
The value of canary is the ability to decide based on signal — error rates, latency, business metrics. Without dashboards and alerts tuned to the new version, canary collapses into a delayed RollingUpdate with extra operational burden.