Pause and Resume Patterns for Kubernetes Operators

Tech reviewed: Deepak Prasad
Pause and Resume Patterns for Kubernetes Operators

Operators run continuous reconciliation: they notice drift and bring the cluster back toward desired state. Sometimes you need the opposite—a freeze while humans or pipelines work—without deleting the CustomResource or tearing down the operand. That is what pause and resume patterns are for.

This guide compares spec.paused vs annotations, spells out what should still run when paused, how to interact with finalizers and Helm-based limits, and how to avoid fighting GitOps controllers. You should already understand desired state vs actual state and finalizers before you implement pause in production.


Why operators need an explicit pause contract

Problems pause solves (and problems it creates)

Pause helps when you need to:

  • Stop an operator from overwriting manual hotfixes on child objects.
  • Serialize maintenance (database failover, certificate rotation) without deleting the CR.
  • Coordinate with a GitOps sync that temporarily applies an intermediate commit.

Pause does not magically free CPU on nodes, uninstall Helm releases, or fix bad RBAC—it only changes what your reconciler chooses to do.

Pause vs delete vs scale-to-zero

  • Pause — CR exists; operator skips mutating children (per your contract).
  • Delete CR — normal deletion and finalizer flows apply; see finalizers explained.
  • Scale operator Deployment to zero — no reconciliation at all; webhooks and metrics from that binary also stop—often the wrong tool if admission still routes to that service.

Choosing a pause signal: spec.paused vs annotations

spec.paused on the CRD (schema, defaults, validation)

A boolean spec.paused is easy to document, easy to validate with OpenAPI, and shows up in kubectl explain. It versioned with the API: if you promote v1beta1v1, you carry the field intentionally.

Default paused: false in the CRD schema so new objects are explicit. Consider validation rules if pause must not combine with certain fields (for example “paused and replicas: 0” might be contradictory in your domain).

Annotation-based pause when you cannot change the API

If you cannot ship a new CRD schema yet, a conventional annotation such as example.com/paused: "true" works. Downsides: weaker typing, easier typos, harder kubectl columns unless you add server-side print columns.

Pick one convention per operator family; mixing annotation and spec without migration confuses users.

Versioning and API evolution when pause becomes first-class

When you graduate pause from an annotation into spec.paused, write a one-time migration in reconciliation: if the legacy annotation is set, copy to spec and clear the annotation—or document that annotation wins until users move.

Making pause visible in kubectl get (columns, printers)

Add additionalPrinterColumns on the CRD so kubectl get shows PAUSED without -o yaml. That reduces support tickets more than any paragraph in the README.


What “paused” should mean for your operator

Freezing only reconciliation vs freezing all side effects

Be explicit about three layers:

  1. No mutations to managed Deployments, Services, PVCs, Helm releases, etc.
  2. No spec writes back to the CR (unless you allow a controller-owned “observed pause” field).
  3. Status — usually you still update conditions so users see Paused=True.

What still runs when paused (status updates, heartbeats, metrics)

Typical safe behavior while paused:

  • Read the CR and children so status can say “frozen at generation N.”
  • Emit metrics (reconcile skipped counter, paused gauge).
  • Health and readiness probes on the manager Pod.

Decide whether background timers (RequeueAfter) should still fire—often yes, but the reconcile body returns immediately after updating a “paused” condition. See Requeue, RequeueAfter, and error handling.

Whether admission webhooks and conversion still apply

Admission and conversion are API server paths, not your reconciler. Pause usually does not disable them unless you document that explicitly. If validation must reject changes while paused, implement that in webhook logic or CRD validation rules.

Idempotency when toggling pause on and off

Reconcile must tolerate pause flipping true → false at any time. On resume, re-diff children against spec and converge—drift detection patterns apply the moment you start mutating again.


Skipping work: child updates vs deletes

Skipping creates and updates of Deployments, Services, and Helm releases

The common pattern at the top of Reconcile:

text
if paused(cr) {
  return r.writePausedCondition(ctx, cr)
}

All create/update paths for managed resources sit below that guard. Keep the function small so future contributors cannot bypass the gate accidentally.

Drift while paused: read-only detect vs ignore

Choose one user-visible story:

  • Silent ignore — simplest; document that manual edits persist until resume.
  • Condition only — set DriftDetected=True with a message, still no writes.
  • Events — emit warnings sparingly to avoid reconcile storms.

Deletes: honoring user intent when the CR is deleted while paused

Deletion should not stall because of pause unless you intentionally block removal for compliance. Usually you clear pause semantics and run normal finalizer cleanup—resources must still be removable. If you do block, surface it in status and events before someone runs --force.

Finalizers while paused (block vs expedite cleanup)

If pause skipped creating a child, deletion might have less to tear down—still run the same cleanup DAG you would run when unpaused. Do not skip finalizer removal just because paused was true; that creates orphaned cloud resources. Finalizers and multi-resource reconciliation go into the ordering details.


Interaction with finalizers and lifecycle

Adding or removing finalizers under pause

Adding a protection finalizer while paused is fine if your contract says so; removing one early is almost always wrong. Keep finalizer mutations idempotent—same as any other reconcile path.

Cleanup ordering when deletionTimestamp is set

When metadata.deletionTimestamp appears, pause should usually not short-circuit finalizer work. Typical pattern: if deleting, ignore pause for cleanup branches or treat pause as “no new spec work” but “deletion still proceeds.”

Pausing mid-finalizer: safe patterns vs foot-guns

Pausing while a long external delete (object storage bucket) runs can leave users thinking “nothing is happening.” Prefer a Phase such as Deleting with progress in status rather than a literal paused flag during teardown—words matter in incident bridges.


Upgrades, migrations, and Helm-based operators

Pausing during CRD or API version rollout

Pause reduces new spec churn while you roll out a new storedVersion or conversion webhook, but it does not replace the ordering in CRD version upgrades and conversion webhooks. You still need TLS, caBundle, and migration jobs correct before traffic returns.

Helm operator “ceiling” and what pause cannot do

Helm-based operators (Operator SDK helm plugin) map CR fields to chart values. Pause can stop upgrades, but it cannot invent arbitrary Go logic beyond what the chart and watches.yaml express—that ceiling is why teams move to Go or hybrid operators. Read Helm-based Operator Part 2 — lifecycle, drift, hooks, scope, ceiling before you promise enterprise pause semantics on a pure Helm operator.

When pause is the wrong tool (use maintenance windows instead)

If the problem is data migration at a specific time, a Job or runbook plus a maintenance annotation on the operand namespace may fit better than a CR-level pause flag. Pause is for controller behavior, not every operational freeze.


Documenting semantics for users

User-facing contract: one short “Pause behavior” section

Ship a table in your docs: user actionoperandstatusmetrics. Example rows: “patch Deployment by hand,” “delete Pod,” “edit CR spec while paused.”

Examples: YAML snippets for pause on and off

Show kubectl patch for both spec.paused and annotation styles so copy-paste works in restricted environments.

Status conditions: Paused, Progressing, and messages

Align with status and conditions: Paused=True should carry reason and message (MaintenanceWindow, ManualInvestigation). When resume begins, flip to Progressing=True until generation catches up.


Optional: GitOps (Argo CD) without fighting the operator

Who owns desired state when pause is on

GitOps wants the live cluster to match Git. Your operator wants live to match CR spec. When pause stops reconciliation, live may diverge from Git if someone edited manifests out of band—document who wins on resume.

Avoiding sync loops when the operator stops mutating children

If Argo keeps re-applying the same Deployment manifest while a human patched the live Deployment under pause, you can get thrash on resume. Mitigations:

  • Sync windows or temporary ignore annotations Argo understands.
  • ApplicationSet per environment so prod stays frozen while dev moves.
  • Respect operator pause in your pipeline: pause Argo auto-sync until the operator resumes.

Patterns: compare-only and diff suppression

Helm-based Operator vs Flux vs Argo CD compares ownership models. The general rule: one writer per field—if Git owns a manifest field, the operator should not fight it, paused or not.


Testing pause behavior you ship

Table-driven scenarios you should run in CI

Cover at least:

  • pause on then spec edit → no child change until resume.
  • pause on then delete CR → finalizers still complete (or your documented exception fires).
  • resume → convergence from a deliberately drifted child.
  • webhook / conversion still behave (smoke test an apply).

Kind / Minikube checks for delete-while-paused and unpause recovery

Automate a script that applies fixtures, toggles pause with patch, captures object generations, and asserts metrics counters if you export them—testing with envtest and kind is the right series entry for harness details.


FAQ

Should pause block manual kubectl delete pod?
No—that is kubelet and workload behavior. Your docs should say whether the operator will recreate the Pod on resume.

Can two controllers implement different pause annotations?
You can, but users will hate it. Prefer one namespace-level or CR-level contract per system.

Does pause reduce apiserver load?
Often yes, if you also stop noisy status loops—see avoid reconcile loop explosions.


See also

These tutorials in the Kubernetes Operators series go well with pause and resume:

Upstream references

Bottom line: pick spec.paused or one annotation convention, document exactly what still runs (status, metrics, webhooks), never strand users in Terminating because of pause, and treat Helm and GitOps ownership as part of the same contract—not an afterthought.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels across development, DevOps, …

  • Red Hat Certified System Administrator in Red Hat OpenStack
  • Certified Kubernetes Application Developer (CKAD)
  • Red Hat Certified Specialist in Ansible Automation
  • Go (programming language)
  • Python (programming language)
  • DevOps
  • Computer Security