DevOps

Kubernetes Upgrade Runbook 2026: Safe Minor-Version Upgrades with Fast Rollback

When teams say a Kubernetes upgrade was “easy,” it usually means they had a runbook before they needed one. Most incidents around control-plane upgrades are not caused by exotic bugs; they happen because of missing prechecks, version-skew mistakes, or no tested rollback plan. If you run production clusters, the goal is not just to upgrade—it is to upgrade predictably.

This guide gives you a practical, field-tested workflow for minor-version upgrades (for example, 1.35.x to 1.36.x) with kubeadm-based clusters. It is written for operators who want zero-surprise maintenance windows and clean post-upgrade validation.

Why this matters now

Community discussions in DevOps circles and release-week incident reports keep repeating the same pattern: teams rush upgrades for security/compliance reasons, then discover skewed components, stale add-ons, or untested node draining behavior. Kubernetes docs are clear that skipping minor versions is unsupported, and component version skew is strictly bounded. Respecting that policy is the difference between a routine change and a 2 a.m. rollback.

Upgrade design principles

  • One minor version at a time: never jump across unsupported minors.
  • Control plane first: then worker nodes in controlled batches.
  • Treat add-ons as first-class dependencies: CNI, CSI, metrics, ingress, and policy engines can block success.
  • Rollback is a feature: if you cannot quickly restore etcd state and known-good binaries, you are not ready.

Pre-upgrade checklist (non-negotiable)

  1. Read release notes and API deprecations. Identify workloads relying on removed APIs.
  2. Validate skew policy. Confirm kubelet, kube-proxy, and control-plane versions are in supported ranges.
  3. Back up etcd and manifests. Keep backups outside the node and test restore steps.
  4. Freeze noisy deployments. Pause nonessential CI/CD promotions during the maintenance window.
  5. Drain rehearsal. Run a dry maintenance exercise on non-prod to expose PodDisruptionBudget constraints.
  6. Health baseline snapshot. Capture metrics and alerts before touching anything.

Commands you can actually use

1) Baseline cluster state

kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running
kubectl get --raw='/readyz?verbose'
kubectl top nodes

2) Backup etcd on a control-plane node

export ETCDCTL_API=3
etcdctl   --endpoints=https://127.0.0.1:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key   snapshot save /var/backups/etcd-$(date +%F-%H%M).db

3) Plan the target upgrade

kubeadm upgrade plan

4) Upgrade first control-plane node

apt-mark unhold kubeadm && apt-get update
apt-get install -y kubeadm=1.36.x-00
apt-mark hold kubeadm

kubeadm upgrade apply v1.36.x

apt-mark unhold kubelet kubectl && apt-get update
apt-get install -y kubelet=1.36.x-00 kubectl=1.36.x-00
apt-mark hold kubelet kubectl
systemctl daemon-reload && systemctl restart kubelet

5) Drain and upgrade worker nodes in batches

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# update kubeadm + kubelet packages on node
kubeadm upgrade node
systemctl restart kubelet
kubectl uncordon <node>

Post-upgrade validation (what teams skip)

  • All nodes report Ready and expected versions.
  • CoreDNS, CNI, CSI, ingress, metrics-server are healthy.
  • Error rates, latency, and restart counts remain inside normal bands.
  • No surge in admission failures, webhook timeouts, or image pull failures.
  • Key business transactions pass synthetic checks.

Useful quick checks:

kubectl get nodes
kubectl get pods -A | grep -E 'CrashLoopBackOff|Error|Pending'
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50

Rollback strategy in plain language

Rollback should be explicit in your change ticket before you start. At minimum, define:

  • Which condition triggers rollback (for example, sustained 5xx over 10 minutes).
  • Who approves rollback and who executes it.
  • Where the verified etcd snapshot lives and how to restore it.
  • How to pin package versions back to known-good builds.

In practice, the fastest recoveries happen when teams keep immutable node images for the previous version and can rejoin nodes quickly, rather than improvising package downgrades during outage pressure.

Operational pattern that works

A reliable cadence for most teams is: weekly non-prod upgrades, biweekly production windows, and a standardized validation template used by both platform and application owners. This prevents upgrade debt from piling up and keeps each jump small and predictable.

If your cluster supports critical customer workloads, pair the runbook with a 30-minute game day each quarter. Simulate a failed control-plane upgrade and practice restore + communication. It is one of the highest ROI drills a platform team can run.

Final take

Kubernetes upgrades are not scary when your process is boring. Boring is good. Boring means prechecks, controlled sequencing, dependency awareness, and a rollback you have tested before production. Build that muscle, and upgrades stop being emergencies and become routine operations.

References