Kubelet decides when to restart your operator pod, and Kubernetes
decides when to send traffic to it, by polling two HTTP endpoints
your operator exposes: /healthz and /readyz. Controller-runtime
gives you the HTTP server for free; what you put behind those
endpoints determines whether your operator survives transient
hiccups, deploys cleanly, and stays out of restart loops.
This article covers the difference between liveness and readiness, the checks worth adding to each, the probe timing settings that prevent restart cascades, and concrete checks for cache sync, leader election, webhook certs, and external dependencies.
TL;DR — operator probes in 60 seconds
kubebuilder scaffolds the right defaults already. The two things
worth verifying:
// main.go
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
HealthProbeBindAddress: ":8081", // manager.healthprobebindaddress
})
// At the end of main, before mgr.Start:
if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
return err
}
if err := mgr.AddReadyzCheck("ping", healthz.Ping); err != nil {
return err
}# config/manager/manager.yaml
spec:
template:
spec:
containers:
- name: manager
livenessProbe:
httpGet: { path: /healthz, port: 8081 }
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet: { path: /readyz, port: 8081 }
initialDelaySeconds: 5
periodSeconds: 10Then add specific checks for your operator's dependencies (cache sync, webhook certs, external APIs). The rest of this article walks through which checks go where, the production-grade cache-sync pattern, the probe timing knobs that prevent restart loops, and the six pitfalls that ship to real clusters.
A quick analogy: the security guard's clipboard
Imagine a security guard at a building entrance with two checklists:
- "Is the building on fire?" — yes/no, asked every minute. If yes, evacuate (i.e. restart the building).
- "Are the doors unlocked and the elevators working?" — yes/no, asked every minute. If no, redirect visitors to another entrance (i.e. remove from Service endpoints) but don't burn the building down.
That's exactly the split between liveness and readiness:
| Guard's check | Kubernetes probe |
|---|---|
| Is the building on fire? | Liveness probe (/healthz) |
| Are the doors and elevators working? | Readiness probe (/readyz) |
| Evacuate the building | Restart the container |
| Redirect visitors elsewhere | Remove from Endpoints / no new traffic |
A building can fail readiness (elevator out) without being on fire — the right response is to redirect, not evacuate. A building can fail liveness (something is genuinely broken beyond recovery) — the right response is full evacuation and a fresh start.
Prerequisites
- A scaffolded operator with the default
kubebuilder/ Operator SDKmain.go— that is whereHealthProbeBindAddressis wired up. - Familiarity with the controller-runtime architecture — probes plug into the Manager, alongside the cache and webhook server.
- Optional: leader election knowledge if you want lease-aware readiness (see Step 6).
Why operator probes matter
An operator usually has no external HTTP traffic, so it is tempting to leave probes at "200 OK on both endpoints and call it done." Three reasons that's a bad idea:
1. Rolling deploys depend on readiness, even with zero traffic
Kubernetes marks a new replica "ready" the moment its readiness
probe passes — and that is the trigger to terminate the old
replica. With a flat /readyz (always 200), the new pod is
"ready" before its informer cache has synced, the old pod is gone,
and the cluster spends 5–15 seconds with no operator that can
correctly answer Get/List from cache. Drift correction stalls and
reconcile loops silently
miss work during the window. The fix is a proper
cache sync readiness probe, covered in Step 4.
2. Liveness can amplify a control-plane outage
A liveness check that calls the Kubernetes API server feels
sensible — "if I can't talk to the API, kill me." But every
operator pod in the cluster will fire that same probe at the same
time. When the API hiccups, every pod restarts simultaneously,
which is an operator restart cascade that turns a 10-second blip
into a multi-minute recovery. Liveness must be local to the
pod — usually nothing more than healthz.Ping.
3. Admission webhooks fail-closed without readiness
If your operator serves
admission webhooks,
those are HTTP traffic — and the API server starts sending
admission requests the moment kubelet flips your pod to "ready."
A flat readiness probe means webhook calls fail (TLS not yet
loaded, informers not synced) and your failurePolicy: Fail
webhook blocks every cluster write until the pod warms up. Gating
readiness on mgr.GetWebhookServer().StartedChecker() and on
cache sync makes this contract correct.
The rest of this article walks through the checks that make these three guarantees real.
Step 1: What /healthz and /readyz look like
After mgr.Start, the operator pod serves:
kubectl -n memcached-operator-system port-forward deploy/memcached-operator-controller-manager 8081
curl http://localhost:8081/healthz
# ok
curl http://localhost:8081/readyz
# ok
curl http://localhost:8081/readyz?verbose
# [+]ping ok
# [+]informer-cache-sync ok
# [+]webhook-cert ok
# readyz check passedThe ?verbose query parameter lists every registered check and its
result — invaluable for debugging "why does my probe fail?"
By default the HTTP server binds to 0.0.0.0:8081. In a hardened
deployment, bind to 127.0.0.1 instead and configure the probe to
target the loopback — but kubelet can reach any address on the pod,
so most setups leave it on 0.0.0.0.
Step 2: Liveness — when to restart
Liveness probes ask: "is this pod dead and needs a kick?" Failing liveness gets you a container restart (and a stuck-in-restart-loop risk if the failure is permanent).
What belongs in /healthz:
healthz.Ping— answers "yes, I am alive". The minimum.- A "no progress" detector — "if no reconcile has completed in the last N minutes, something is wrong". Useful but tricky; N must exceed the longest expected idle period.
What does not belong:
- Kubernetes API calls. If the API server hiccups, all your liveness probes fail simultaneously, all your operator pods restart, the outage gets worse. Liveness must be local.
- Database / external API checks. Same reason. A flaky external service should not restart your operator; it should fail readiness (if you care) or be silent.
- Cache sync. Cache sync is a startup signal, not a "kill me" signal. Goes in readiness.
A "no progress" liveness check, as a more advanced pattern, ties liveness to actual reconcile activity rather than to "the process is still running" — see the reconcile loop explained for why a deadlocked workqueue can leave the HTTP server happily returning 200 long after the operator has stopped doing useful work:
type lastReconcileCheck struct {
mu sync.RWMutex
lastReconcile time.Time
}
func (c *lastReconcileCheck) Touch() {
c.mu.Lock()
defer c.mu.Unlock()
c.lastReconcile = time.Now()
}
func (c *lastReconcileCheck) Check(_ *http.Request) error {
c.mu.RLock()
defer c.mu.RUnlock()
if time.Since(c.lastReconcile) > 30 * time.Minute {
return fmt.Errorf("no reconcile in 30 min")
}
return nil
}
// Reconcile body calls c.Touch() on success
mgr.AddHealthzCheck("reconcile-progress", check.Check)Use sparingly — it adds operational complexity. For most operators
healthz.Ping is enough.
Step 3: Readiness — when to send traffic
Readiness probes ask: "is this pod ready to do its job?" Failing readiness removes the pod from Service endpoints but leaves it running.
For an operator that doesn't serve HTTP traffic, why bother with readiness? Two reasons:
- Rolling deploys. Kubernetes considers a new replica "ready" when its readiness probe passes. With a flat readiness probe (always 200), the new pod is "ready" the moment its container starts — before informers have synced. Kubernetes terminates the old pod, you get a brief window where no replica has a synced cache, and reconciles stall.
- Admission webhooks. If your operator serves webhooks, those are HTTP traffic. Webhook calls fail until certificates are loaded and informers are synced. Readiness gates that traffic.
What belongs in /readyz:
| Check | Why |
|---|---|
healthz.Ping |
Basic process liveness, included automatically |
| Informer cache sync | The operator can answer Get/List from cache |
| Webhook cert loaded | If admission webhooks are enabled |
| Webhook server ready | If admission webhooks are enabled |
| External dependency reachable | Only if the operator literally cannot work without it |
The first two are mandatory for any operator. The latter three depend on the operator's design.
Step 4: Adding specific checks
Informer cache sync
The single most important readiness check. Without it, the new replica is "ready" before it can actually answer Get/List from a warm cache:
mgr.AddReadyzCheck("informer-cache-sync", func(req *http.Request) error {
if !mgr.GetCache().WaitForCacheSync(req.Context()) {
return fmt.Errorf("informer cache not synced")
}
return nil
})WaitForCacheSync returns true once every registered informer
has populated its initial cache. This is the simplest workable
version — but it has a subtle problem in production: the first
call blocks until sync (or the request context expires). With a
timeoutSeconds: 1 probe and a slow cluster, the kubelet sees a
timeout instead of a clean "not yet ready" 500, and the failure
mode is harder to diagnose.
Production-grade cache-sync check
The fix is to do the wait once, in a background Runnable,
and have the probe handler read a flag:
var cacheReady atomic.Bool
if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
if !mgr.GetCache().WaitForCacheSync(ctx) {
return fmt.Errorf("cache failed to sync")
}
cacheReady.Store(true)
<-ctx.Done()
return nil
})); err != nil {
return err
}
if err := mgr.AddReadyzCheck("informer-cache-sync", func(_ *http.Request) error {
if !cacheReady.Load() {
return fmt.Errorf("informer cache not yet synced")
}
return nil
}); err != nil {
return err
}The probe handler is now O(1) and never blocks, which keeps the
kubelet probe failure threshold
honest under load and prevents spurious restarts from a probe
timeout. This is the cache sync readiness probe pattern most
mature operators ship.
Webhook readiness
If your operator serves admission webhooks, the
webhook server startedchecker must have its certificate
loaded and be listening before traffic can flow:
mgr.AddReadyzCheck("webhook-server", mgr.GetWebhookServer().StartedChecker())StartedChecker() returns nil once the server is listening. Until
then the readiness probe fails — and the operator stays out of the
Service endpoints, so API server admission requests go elsewhere
(or fail-closed, depending on
the webhook's failurePolicy).
External dependency
If your operator needs an external API to function (a license server, a configuration backend), gate readiness on it:
mgr.AddReadyzCheck("license-server", func(req *http.Request) error {
httpReq, _ := http.NewRequestWithContext(req.Context(), "GET", licenseURL, nil)
resp, err := http.DefaultClient.Do(httpReq)
if err != nil { return err }
defer resp.Body.Close()
if resp.StatusCode != 200 {
return fmt.Errorf("license server returned %d", resp.StatusCode)
}
return nil
})Caveat: this only makes sense if your operator truly cannot
work without the dependency. A transient outage of a "nice to have"
external service should not cause your readiness probe to fail —
treat that as a per-reconcile error you surface in
.status.conditions, not as
a pod-level "I'm not ready" signal.
Step 5: Probe timing settings
The four timing knobs in the probe spec:
| Field | Default | Recommended for operators |
|---|---|---|
initialDelaySeconds |
0 | 15 (liveness), 5 (readiness) |
periodSeconds |
10 | 20 (liveness), 10 (readiness) |
timeoutSeconds |
1 | 3 |
failureThreshold |
3 | 3 — the kubelet probe failure threshold is multiplied by periodSeconds to get total grace time |
successThreshold stays at the default 1; raising it only
introduces flapping. The interesting interaction is between
periodSeconds and failureThreshold: those two values multiplied
give you the grace window before kubelet acts. A periodSeconds: 20
failureThreshold: 3liveness probe gives the operator 60 seconds of "broken" before a restart — almost always the right trade.
Why those defaults are not enough:
initialDelaySeconds: 0— kubelet starts probing immediately. For an operator with multiple CRD informers, the cache needs 5–15 seconds to warm up. Without an initial delay, liveness fails for the first 30 seconds (3 fails × 10 s) and kubelet restarts the container. Loop.timeoutSeconds: 1— 1 second is tight for a check that iterates informer state. A flaky 800 ms response time causes spurious failures.
A good starting point:
livenessProbe:
httpGet: { path: /healthz, port: 8081 }
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet: { path: /readyz, port: 8081 }
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3Adjust upward if the operator legitimately takes longer to warm up (many CRDs, large clusters with slow caches).
Step 6: Interaction with leader election
A subtle question for operators that run with multiple replicas and leader election: should standby (non-leader) replicas be ready or not ready?
Both standbys and leader should report ready because:
- Standbys serve Prometheus metrics — scrapes don't care which replica holds the lease, and you'd rather not lose a data point during failover.
- Standbys may serve webhooks (controller-runtime starts the webhook server on every replica; not all webhook implementations gate on leadership).
- Marking standbys "not ready" creates a no-traffic state during the ~15 s failover window — making the failover less graceful.
If you want leader-aware behaviour, do it inside specific endpoints
or mgr.Add(runnable)-registered workers, not in the readiness
probe.
The one exception: if your operator's /metrics endpoint reports
per-CR state that the standby cannot compute (no informers), you
might want metrics to come only from the leader. In that case, gate
the metrics endpoint via a separate Service selector — not via
readiness.
Step 7: Diagnosing probe failures
When kubelet keeps restarting your pod, check three things:
1. The probe response itself
kubectl -n memcached-operator-system port-forward deploy/memcached-operator-controller-manager 8081
curl -i http://localhost:8081/readyz?verboseThis tells you which check is failing and why. If everything returns ok manually but kubelet sees failures, the issue is timing — the probe is too aggressive or the operator startup is too slow.
2. The kubelet events
kubectl describe pod -n memcached-operator-system <pod-name>Look for:
Warning Unhealthy ... Readiness probe failed: HTTP probe failed with statuscode: 500The status code distinguishes "no response" (probably network/firewall) from "explicit failure" (a check returned an error).
3. The startup logs
kubectl logs -n memcached-operator-system <pod-name> --previous=true | head -30The first 30 lines often reveal the startup issue. If you see
unable to start manager: Get "/api/v1/...", the operator never
got to the AddHealthzCheck line — its API setup failed and the
probes were never registered.
Step 8: A complete production probe block
For a typical operator with webhooks:
// main.go - end of setup
var cacheReady atomic.Bool
if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
if !mgr.GetCache().WaitForCacheSync(ctx) {
return fmt.Errorf("cache failed to sync")
}
cacheReady.Store(true)
<-ctx.Done()
return nil
})); err != nil {
return err
}
if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
return err
}
if err := mgr.AddReadyzCheck("informer-cache-sync", func(_ *http.Request) error {
if !cacheReady.Load() {
return fmt.Errorf("informer cache not yet synced")
}
return nil
}); err != nil {
return err
}
if err := mgr.AddReadyzCheck("webhook-server", mgr.GetWebhookServer().StartedChecker()); err != nil {
return err
}# manager.yaml
spec:
terminationGracePeriodSeconds: 60
containers:
- name: manager
livenessProbe:
httpGet: { path: /healthz, port: 8081 }
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet: { path: /readyz, port: 8081 }
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
startupProbe: # optional, for very slow starts
httpGet: { path: /readyz, port: 8081 }
periodSeconds: 5
failureThreshold: 30 # 5 × 30 = 150 s budget for startupstartupProbe is the cleanest way to handle "this startup probe operator legitimately takes 90 seconds to start" — common for
operators that import many CRDs or wait on a slow external
backend. While startupProbe is running, liveness and readiness
are suppressed entirely; only when startup succeeds does normal
probing begin.
If you enable
startupProbe, dropinitialDelaySecondson the other two probes back to 0 (or remove the field). The startup probe already gates them, and a 15 s delay on top is dead time for nothing. The block above setsinitialDelaySecondson liveness/readiness so it works with or without the startup probe; in production you usually pick one strategy.
Reach for startupProbe only if 15 s of initialDelaySeconds isn't
enough.
Common pitfalls
1. No readiness probe at all
New replicas are "ready" before informers sync. Rolling deploys have a brief no-synced-replica window. Fix: add cache sync to readiness.
2. Liveness check that calls the API server
API hiccup → every operator pod restarts simultaneously → cluster
makes the outage worse. Fix: keep liveness local; use
healthz.Ping.
3. initialDelaySeconds: 0 on a slow-starting operator
Liveness fires before the operator finishes initialising, kubelet
restarts the container, repeat. Stuck in CrashLoopBackOff. Fix:
add initialDelaySeconds: 15-30 or use a startupProbe.
4. Single shared check on both probes
Liveness and readiness use the same function. Now any transient
issue restarts the container instead of just removing it from
Service. Fix: separate the checks. healthz.Ping for liveness;
informer-sync + webhook-readiness for readiness.
5. Readiness depends on cluster-wide state
Your readiness probe checks "is the leader pod alive?" — but on the leader pod, that depends on the standbys' health. Cycle. Fix: readiness should depend only on this pod's state.
Pitfall cheat sheet
| Symptom | Root cause | Fix |
|---|---|---|
Pod stuck in CrashLoopBackOff immediately after deploy |
initialDelaySeconds: 0 on an operator that needs 10–15 s to warm up |
Set initialDelaySeconds: 15 on liveness, or add a startupProbe |
| Every operator pod in the cluster restarts during an API-server blip | Liveness check calls the Kubernetes API | Replace with healthz.Ping; move API-dependent checks to readiness |
| Rolling deploy briefly has zero synced replicas | No readiness probe, or /readyz returns 200 before cache sync |
Register the flag-flip cache sync readiness probe from Step 4 |
kubectl describe pod shows Readiness probe failed: timeout under load |
Probe handler blocks (e.g. calls WaitForCacheSync directly) |
Switch to the production-grade pattern — handler reads an atomic.Bool |
| Admission webhook returns "connection refused" right after a deploy | No webhook-server readiness check, traffic routes to a pod before TLS is loaded |
Add mgr.GetWebhookServer().StartedChecker() to readiness |
| Standby replica drops out of Prometheus scrapes during failover | Readiness gated on leadership | Both leader and standby should report ready; gate leader-only work inside mgr.Add(runnable) instead |
Frequently Asked Questions
1. What is the difference between liveness and readiness probes?
Liveness asks "is this pod alive?" If it fails, kubelet restarts the container. Use it to detect deadlocks or stuck goroutines. Readiness asks "is this pod ready to do work?" If it fails, the pod stays running but is removed from Service endpoints. Use it to gate traffic until startup is complete. For an operator with no external traffic, readiness mainly gates the rolling-deploy strategy.2. What does controller-runtime expose by default?
Manager.HealthProbeBindAddress serves two endpoints once it is set: /healthz for liveness and /readyz for readiness. The library default is "0" (probe server disabled), but the kubebuilder scaffold sets it to :8081. Both endpoints return 200 OK with no checks registered; you add specific checks with mgr.AddHealthzCheck and mgr.AddReadyzCheck. Without explicit checks they only mean "the process is up".3. How do I add a custom health check?
Callmgr.AddHealthzCheck(name, checker) or mgr.AddReadyzCheck(name, checker) before mgr.Start. A checker is a func(req *http.Request) error - returning nil means healthy, returning an error makes the endpoint return 500 and the kubelet mark the probe failed. The simplest checker is healthz.Ping from sigs.k8s.io/controller-runtime/pkg/healthz; a richer one looks up state in your operator (cache sync flag, last reconcile timestamp, webhook cert presence).4. When does an operator pod need a readiness probe?
Always - even though operators have no external traffic, readiness affects rolling deploys. Without a readiness probe, the new replica is added to "ready" pods the moment its container starts, even before its informer cache has synced. Kubernetes thinks the rolling deploy succeeded, terminates the old replica, and you get a brief window with no synced operator.5. What checks should be in /readyz?
Cache sync (informers are synced), leader election (this replica is the active leader or a standby - both are "ready"), webhook certificate availability (if webhooks are enabled), and external dependency reachability if your operator needs an external API.6. What checks should be in /healthz?
Things that justify restarting the pod. The simplest ishealthz.Ping - just answer "yes I am alive". More advanced: a "no reconcile in N minutes" check that detects a deadlocked workqueue. Avoid putting expensive checks in /healthz; the probe runs frequently and a slow check delays detection.7. How do I tune probe timing to avoid restart loops?
initialDelaySeconds (default 0): how long to wait after container start before the first probe - set to cover slow-start scenarios (e.g. 15s for an operator with many CRDs). periodSeconds (default 10): how often to probe. failureThreshold (default 3): consecutive failures before kubelet acts. timeoutSeconds (default 1): how long each probe can take. A common starting point: 15s initial, 10s period, 3 failures.8. Should liveness probe call the Kubernetes API?
No - your liveness probe should be local to the pod. If liveness depends on the API server, an API server outage causes all your operator pods to restart simultaneously, making the outage worse. Usehealthz.Ping for liveness and put API-dependent checks in readiness only.9. Should I use a startupProbe for my operator?
Only if 15-30 seconds ofinitialDelaySeconds on the liveness probe isn't enough. A startupProbe suppresses both liveness and readiness until it succeeds, which is the cleanest way to handle slow-starting operators (many CRDs, large clusters, slow external backends). When you use it, drop initialDelaySeconds on liveness/readiness back to 0 - the startup probe already gates them, and adding more delay is dead time for nothing. The typical budget is periodSeconds: 5 * failureThreshold: 30 = 150 seconds of startup grace.10. Why does my readiness probe time out under load?
The most common cause is a probe handler that itself blocks - for example,func(req *http.Request) error { mgr.GetCache().WaitForCacheSync(req.Context()); ... }. The first call blocks until the cache syncs, which can exceed the timeoutSeconds: 1 default, and kubelet sees a timeout instead of a clean 500. Fix: do the wait once in a background manager.RunnableFunc, set an atomic.Bool when it returns, and have the probe handler just read the flag in O(1). Bumping timeoutSeconds is a workaround, not a fix.Summary
Liveness and readiness probes are how Kubernetes decides when to
restart your operator and when to send it traffic. The
controller-runtime Manager
provides the endpoints; you register the checks. Default to
healthz.Ping on liveness, the flag-flipped cache sync readiness probe on /readyz, and the webhook server startedchecker if
you serve admission webhooks. Tune initialDelaySeconds to cover
real startup time (or reach for a startup probe operator
configuration), never put cross-pod or API-server checks in
liveness, and verify behaviour with kubectl describe and
curl /readyz?verbose.
Done correctly, probes give you clean rolling deploys, fast detection of stuck operators, and no restart cascades during cluster hiccups. Done wrong, they amplify outages and turn flaky pods into permanent crash loops.
Further reading
- Operator RBAC minimum permissions — the next article in the series: tightening the ServiceAccount permissions probe checks may depend on.
- Controller-runtime architecture — where the Manager, cache, webhook server, and health probe bind-address all live together.
- Operator leader election explained — why standbys should still report ready, and the failover window probes need to span gracefully.
- Operator metrics with Prometheus —
the
/metricsendpoint that runs alongside/healthzand/readyzon the same Manager. - The reconcile loop explained — the "no progress" liveness check from Step 2 only makes sense once you've seen how the workqueue and reconcile contract work.
- Status subresource and Conditions — the right place to surface transient external-dependency failures (instead of failing readiness).
- Mutating & validating admission webhooks —
pairs with the
webhook server startedcheckerreadiness check. - Kubernetes Operator Tutorial — full course hub — the full series; this article lives in the operability chapter alongside metrics, leader election, and RBAC.
- External: Pod liveness/readiness probe docs, container lifecycle hooks, and controller-runtime healthz package.

