Operator Health and Readiness Probes: /healthz, /readyz, AddHealthzCheck

Q: What does controller-runtime expose by default?

`Manager.HealthProbeBindAddress` serves two endpoints once it is set: `/healthz` for liveness and `/readyz` for readiness. The library default is `"0"` (probe server disabled), but the kubebuilder scaffold sets it to `:8081`. Both endpoints return 200 OK with no checks registered; you add specific checks with `mgr.AddHealthzCheck` and `mgr.AddReadyzCheck`. Without explicit checks they only mean "the process is up".

Q: How do I add a custom health check?

Call `mgr.AddHealthzCheck(name, checker)` or `mgr.AddReadyzCheck(name, checker)` before `mgr.Start`. A checker is a `func(req *http.Request) error` - returning nil means healthy, returning an error makes the endpoint return 500 and the kubelet mark the probe failed. The simplest checker is `healthz.Ping` from `sigs.k8s.io/controller-runtime/pkg/healthz`; a richer one looks up state in your operator (cache sync flag, last reconcile timestamp, webhook cert presence).

Q: What checks should be in /healthz?

Things that justify restarting the pod. The simplest is `healthz.Ping` - just answer "yes I am alive". More advanced: a "no reconcile in N minutes" check that detects a deadlocked workqueue. Avoid putting expensive checks in `/healthz`; the probe runs frequently and a slow check delays detection.

Q: How do I tune probe timing to avoid restart loops?

`initialDelaySeconds` (default 0): how long to wait after container start before the first probe - set to cover slow-start scenarios (e.g. 15s for an operator with many CRDs). `periodSeconds` (default 10): how often to probe. `failureThreshold` (default 3): consecutive failures before kubelet acts. `timeoutSeconds` (default 1): how long each probe can take. A common starting point: 15s initial, 10s period, 3 failures.

Q: Should I use a startupProbe for my operator?

Only if 15-30 seconds of `initialDelaySeconds` on the liveness probe isn't enough. A `startupProbe` suppresses both liveness and readiness until it succeeds, which is the cleanest way to handle slow-starting operators (many CRDs, large clusters, slow external backends). When you use it, drop `initialDelaySeconds` on liveness/readiness back to 0 - the startup probe already gates them, and adding more delay is dead time for nothing. The typical budget is `periodSeconds: 5` * `failureThreshold: 30` = 150 seconds of startup grace.

Q: Why does my readiness probe time out under load?

The most common cause is a probe handler that *itself* blocks - for example, `func(req *http.Request) error { mgr.GetCache().WaitForCacheSync(req.Context()); ... }`. The first call blocks until the cache syncs, which can exceed the `timeoutSeconds: 1` default, and kubelet sees a timeout instead of a clean 500. Fix: do the wait once in a background `manager.RunnableFunc`, set an `atomic.Bool` when it returns, and have the probe handler just read the flag in O(1). Bumping `timeoutSeconds` is a workaround, not a fix.

A practical guide to liveness and readiness probes for Kubernetes operators: what `/healthz` and `/readyz` should report, how to register custom checks with `mgr.AddHealthzCheck` and `mgr.AddReadyzCheck`, the difference between liveness and readiness semantics, probe timing settings that prevent restart loops, and diagnostic checks for cache sync, leader status, webhook certificates, and external dependencies.

Published May 30, 2026

Updated Jun 4, 2026

Author Deepak Prasad

Read time 16 min read

Reviewed Jun 4, 2026 byDeepak Prasad

Kubelet decides when to restart your operator pod, and Kubernetes decides when to send traffic to it, by polling two HTTP endpoints your operator exposes: /healthz and /readyz. Controller-runtime gives you the HTTP server for free; what you put behind those endpoints determines whether your operator survives transient hiccups, deploys cleanly, and stays out of restart loops.

This article covers the difference between liveness and readiness, the checks worth adding to each, the probe timing settings that prevent restart cascades, and concrete checks for cache sync, leader election, webhook certs, and external dependencies.

TL;DR — operator probes in 60 seconds

kubebuilder scaffolds the right defaults already. The two things worth verifying:

go


// main.go
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    HealthProbeBindAddress: ":8081",  // manager.healthprobebindaddress
})

// At the end of main, before mgr.Start:
if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
    return err
}
if err := mgr.AddReadyzCheck("ping", healthz.Ping); err != nil {
    return err
}

yaml


# config/manager/manager.yaml
spec:
  template:
    spec:
      containers:
      - name: manager
        livenessProbe:
          httpGet: { path: /healthz, port: 8081 }
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet: { path: /readyz, port: 8081 }
          initialDelaySeconds: 5
          periodSeconds: 10

Then add specific checks for your operator's dependencies (cache sync, webhook certs, external APIs). The rest of this article walks through which checks go where, the production-grade cache-sync pattern, the probe timing knobs that prevent restart loops, and the six pitfalls that ship to real clusters.

A quick analogy: the security guard's clipboard

Imagine a security guard at a building entrance with two checklists:

"Is the building on fire?" — yes/no, asked every minute. If yes, evacuate (i.e. restart the building).
"Are the doors unlocked and the elevators working?" — yes/no, asked every minute. If no, redirect visitors to another entrance (i.e. remove from Service endpoints) but don't burn the building down.

That's exactly the split between liveness and readiness:

Guard's check	Kubernetes probe
Is the building on fire?	Liveness probe (`/healthz`)
Are the doors and elevators working?	Readiness probe (`/readyz`)
Evacuate the building	Restart the container
Redirect visitors elsewhere	Remove from `Endpoints` / no new traffic

A building can fail readiness (elevator out) without being on fire — the right response is to redirect, not evacuate. A building can fail liveness (something is genuinely broken beyond recovery) — the right response is full evacuation and a fresh start.

Kubernetes operator liveness vs readiness probes visualised as a security guard holding two clipboards — the red "Liveness /healthz" clipboard asks "Is the building on fire? → evacuate (restart container)", and the green "Readiness /readyz" clipboard asks "Are doors and elevators working? → redirect visitors (remove from Service endpoints)". Two different questions, two different remediations.

Prerequisites

A scaffolded operator with the default kubebuilder / Operator SDK main.go — that is where HealthProbeBindAddress is wired up.
Familiarity with the controller-runtime architecture — probes plug into the Manager, alongside the cache and webhook server.
Optional: leader election knowledge if you want lease-aware readiness (see Step 6).

Why operator probes matter

An operator usually has no external HTTP traffic, so it is tempting to leave probes at "200 OK on both endpoints and call it done." Three reasons that's a bad idea:

1. Rolling deploys depend on readiness, even with zero traffic

Kubernetes marks a new replica "ready" the moment its readiness probe passes — and that is the trigger to terminate the old replica. With a flat /readyz (always 200), the new pod is "ready" before its informer cache has synced, the old pod is gone, and the cluster spends 5–15 seconds with no operator that can correctly answer Get/List from cache. Drift correction stalls and reconcile loops silently miss work during the window. The fix is a proper cache sync readiness probe, covered in Step 4.

2. Liveness can amplify a control-plane outage

A liveness check that calls the Kubernetes API server feels sensible — "if I can't talk to the API, kill me." But every operator pod in the cluster will fire that same probe at the same time. When the API hiccups, every pod restarts simultaneously, which is an operator restart cascade that turns a 10-second blip into a multi-minute recovery. Liveness must be local to the pod — usually nothing more than healthz.Ping.

3. Admission webhooks fail-closed without readiness

If your operator serves admission webhooks, those are HTTP traffic — and the API server starts sending admission requests the moment kubelet flips your pod to "ready." A flat readiness probe means webhook calls fail (TLS not yet loaded, informers not synced) and your failurePolicy: Fail webhook blocks every cluster write until the pod warms up. Gating readiness on mgr.GetWebhookServer().StartedChecker() and on cache sync makes this contract correct.

The rest of this article walks through the checks that make these three guarantees real.

Step 1: What `/healthz` and `/readyz` look like

After mgr.Start, the operator pod serves:

Confirm the service responds on localhost with curl; see the curl command for verbose mode, timeouts, and follow-redirect flags.

The example below uses ping for a quick reachability check; the ping command documents -c, -W, and reading packet-loss statistics.

bash


kubectl -n memcached-operator-system port-forward deploy/memcached-operator-controller-manager 8081

curl http://localhost:8081/healthz
# ok

curl http://localhost:8081/readyz
# ok

curl http://localhost:8081/readyz?verbose
# [+]ping ok
# [+]informer-cache-sync ok
# [+]webhook-cert ok
# readyz check passed

The ?verbose query parameter lists every registered check and its result — invaluable for debugging "why does my probe fail?"

By default the HTTP server binds to 0.0.0.0:8081. In a hardened deployment, bind to 127.0.0.1 instead and configure the probe to target the loopback — but kubelet can reach any address on the pod, so most setups leave it on 0.0.0.0.

Step 2: Liveness — when to restart

Liveness probes ask: "is this pod dead and needs a kick?" Failing liveness gets you a container restart (and a stuck-in-restart-loop risk if the failure is permanent).

What belongs in /healthz:

healthz.Ping — answers "yes, I am alive". The minimum.
A "no progress" detector — "if no reconcile has completed in the last N minutes, something is wrong". Useful but tricky; N must exceed the longest expected idle period.

What does not belong:

Kubernetes API calls. If the API server hiccups, all your liveness probes fail simultaneously, all your operator pods restart, the outage gets worse. Liveness must be local.
Database / external API checks. Same reason. A flaky external service should not restart your operator; it should fail readiness (if you care) or be silent.
Cache sync. Cache sync is a startup signal, not a "kill me" signal. Goes in readiness.

A "no progress" liveness check, as a more advanced pattern, ties liveness to actual reconcile activity rather than to "the process is still running" — see the reconcile loop explained for why a deadlocked workqueue can leave the HTTP server happily returning 200 long after the operator has stopped doing useful work:

go


type lastReconcileCheck struct {
    mu             sync.RWMutex
    lastReconcile  time.Time
}

func (c *lastReconcileCheck) Touch() {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.lastReconcile = time.Now()
}

func (c *lastReconcileCheck) Check(_ *http.Request) error {
    c.mu.RLock()
    defer c.mu.RUnlock()
    if time.Since(c.lastReconcile) > 30 * time.Minute {
        return fmt.Errorf("no reconcile in 30 min")
    }
    return nil
}

// Reconcile body calls c.Touch() on success
mgr.AddHealthzCheck("reconcile-progress", check.Check)

Use sparingly — it adds operational complexity. For most operators healthz.Ping is enough.

Step 3: Readiness — when to send traffic

Readiness probes ask: "is this pod ready to do its job?" Failing readiness removes the pod from Service endpoints but leaves it running.

For an operator that doesn't serve HTTP traffic, why bother with readiness? Two reasons:

Rolling deploys. Kubernetes considers a new replica "ready" when its readiness probe passes. With a flat readiness probe (always 200), the new pod is "ready" the moment its container starts — before informers have synced. Kubernetes terminates the old pod, you get a brief window where no replica has a synced cache, and reconciles stall.
Admission webhooks. If your operator serves webhooks, those are HTTP traffic. Webhook calls fail until certificates are loaded and informers are synced. Readiness gates that traffic.

What belongs in /readyz:

Check	Why
`healthz.Ping`	Basic process liveness, included automatically
Informer cache sync	The operator can answer Get/List from cache
Webhook cert loaded	If admission webhooks are enabled
Webhook server ready	If admission webhooks are enabled
External dependency reachable	Only if the operator literally cannot work without it

The first two are mandatory for any operator. The latter three depend on the operator's design.

Step 4: Adding specific checks

Informer cache sync

The single most important readiness check. Without it, the new replica is "ready" before it can actually answer Get/List from a warm cache:

go


mgr.AddReadyzCheck("informer-cache-sync", func(req *http.Request) error {
    if !mgr.GetCache().WaitForCacheSync(req.Context()) {
        return fmt.Errorf("informer cache not synced")
    }
    return nil
})

WaitForCacheSync returns true once every registered informer has populated its initial cache. This is the simplest workable version — but it has a subtle problem in production: the first call blocks until sync (or the request context expires). With a timeoutSeconds: 1 probe and a slow cluster, the kubelet sees a timeout instead of a clean "not yet ready" 500, and the failure mode is harder to diagnose.

Production-grade cache-sync check

The fix is to do the wait once, in a background Runnable, and have the probe handler read a flag:

go


var cacheReady atomic.Bool

if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
    if !mgr.GetCache().WaitForCacheSync(ctx) {
        return fmt.Errorf("cache failed to sync")
    }
    cacheReady.Store(true)
    <-ctx.Done()
    return nil
})); err != nil {
    return err
}

if err := mgr.AddReadyzCheck("informer-cache-sync", func(_ *http.Request) error {
    if !cacheReady.Load() {
        return fmt.Errorf("informer cache not yet synced")
    }
    return nil
}); err != nil {
    return err
}

The probe handler is now O(1) and never blocks, which keeps the kubelet probe failure threshold honest under load and prevents spurious restarts from a probe timeout. This is the cache sync readiness probe pattern most mature operators ship.

Webhook readiness

If your operator serves admission webhooks, the webhook server startedchecker must have its certificate loaded and be listening before traffic can flow:

go

mgr.AddReadyzCheck("webhook-server", mgr.GetWebhookServer().StartedChecker())

StartedChecker() returns nil once the server is listening. Until then the readiness probe fails — and the operator stays out of the Service endpoints, so API server admission requests go elsewhere (or fail-closed, depending on the webhook's failurePolicy).

External dependency

If your operator needs an external API to function (a license server, a configuration backend), gate readiness on it:

go


mgr.AddReadyzCheck("license-server", func(req *http.Request) error {
    httpReq, _ := http.NewRequestWithContext(req.Context(), "GET", licenseURL, nil)
    resp, err := http.DefaultClient.Do(httpReq)
    if err != nil { return err }
    defer resp.Body.Close()
    if resp.StatusCode != 200 {
        return fmt.Errorf("license server returned %d", resp.StatusCode)
    }
    return nil
})

Caveat: this only makes sense if your operator truly cannot work without the dependency. A transient outage of a "nice to have" external service should not cause your readiness probe to fail — treat that as a per-reconcile error you surface in .status.conditions, not as a pod-level "I'm not ready" signal.

Step 5: Probe timing settings

The four timing knobs in the probe spec:

Field	Default	Recommended for operators
`initialDelaySeconds`	0	15 (liveness), 5 (readiness)
`periodSeconds`	10	20 (liveness), 10 (readiness)
`timeoutSeconds`	1	3
`failureThreshold`	3	3 — the `kubelet probe failure threshold` is multiplied by `periodSeconds` to get total grace time

successThreshold stays at the default 1; raising it only introduces flapping. The interesting interaction is between periodSeconds and failureThreshold: those two values multiplied give you the grace window before kubelet acts. A periodSeconds: 20

failureThreshold: 3 liveness probe gives the operator 60 seconds of "broken" before a restart — almost always the right trade.

Why those defaults are not enough:

initialDelaySeconds: 0 — kubelet starts probing immediately. For an operator with multiple CRD informers, the cache needs 5–15 seconds to warm up. Without an initial delay, liveness fails for the first 30 seconds (3 fails × 10 s) and kubelet restarts the container. Loop.
timeoutSeconds: 1 — 1 second is tight for a check that iterates informer state. A flaky 800 ms response time causes spurious failures.

A good starting point:

yaml


livenessProbe:
  httpGet: { path: /healthz, port: 8081 }
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 3
  failureThreshold: 3
readinessProbe:
  httpGet: { path: /readyz, port: 8081 }
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

Adjust upward if the operator legitimately takes longer to warm up (many CRDs, large clusters with slow caches).

Step 6: Interaction with leader election

A subtle question for operators that run with multiple replicas and leader election: should standby (non-leader) replicas be ready or not ready?

Both standbys and leader should report ready because:

Standbys serve Prometheus metrics — scrapes don't care which replica holds the lease, and you'd rather not lose a data point during failover.
Standbys may serve webhooks (controller-runtime starts the webhook server on every replica; not all webhook implementations gate on leadership).
Marking standbys "not ready" creates a no-traffic state during the ~15 s failover window — making the failover less graceful.

If you want leader-aware behaviour, do it inside specific endpoints or mgr.Add(runnable)-registered workers, not in the readiness probe.

The one exception: if your operator's /metrics endpoint reports per-CR state that the standby cannot compute (no informers), you might want metrics to come only from the leader. In that case, gate the metrics endpoint via a separate Service selector — not via readiness.

Step 7: Diagnosing probe failures

When kubelet keeps restarting your pod, check three things:

1. The probe response itself

bash


kubectl -n memcached-operator-system port-forward deploy/memcached-operator-controller-manager 8081
curl -i http://localhost:8081/readyz?verbose

This tells you which check is failing and why. If everything returns ok manually but kubelet sees failures, the issue is timing — the probe is too aggressive or the operator startup is too slow.

2. The kubelet events

bash

kubectl describe pod -n memcached-operator-system <pod-name>

Look for:

text

Warning  Unhealthy  ...  Readiness probe failed: HTTP probe failed with statuscode: 500

The status code distinguishes "no response" (probably network/firewall) from "explicit failure" (a check returned an error).

3. The startup logs

bash

kubectl logs -n memcached-operator-system <pod-name> --previous=true | head -30

The first 30 lines often reveal the startup issue. If you see unable to start manager: Get "/api/v1/...", the operator never got to the AddHealthzCheck line — its API setup failed and the probes were never registered.

Step 8: A complete production probe block

For a typical operator with webhooks:

go


// main.go - end of setup
var cacheReady atomic.Bool
if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
    if !mgr.GetCache().WaitForCacheSync(ctx) {
        return fmt.Errorf("cache failed to sync")
    }
    cacheReady.Store(true)
    <-ctx.Done()
    return nil
})); err != nil {
    return err
}

if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
    return err
}
if err := mgr.AddReadyzCheck("informer-cache-sync", func(_ *http.Request) error {
    if !cacheReady.Load() {
        return fmt.Errorf("informer cache not yet synced")
    }
    return nil
}); err != nil {
    return err
}
if err := mgr.AddReadyzCheck("webhook-server", mgr.GetWebhookServer().StartedChecker()); err != nil {
    return err
}

yaml


# manager.yaml
spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: manager
    livenessProbe:
      httpGet: { path: /healthz, port: 8081 }
      initialDelaySeconds: 15
      periodSeconds: 20
      timeoutSeconds: 3
      failureThreshold: 3
    readinessProbe:
      httpGet: { path: /readyz, port: 8081 }
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 3
      failureThreshold: 3
    startupProbe:                    # optional, for very slow starts
      httpGet: { path: /readyz, port: 8081 }
      periodSeconds: 5
      failureThreshold: 30           # 5 × 30 = 150 s budget for startup

startupProbe is the cleanest way to handle "this startup probe operator legitimately takes 90 seconds to start" — common for operators that import many CRDs or wait on a slow external backend. While startupProbe is running, liveness and readiness are suppressed entirely; only when startup succeeds does normal probing begin.

If you enable startupProbe, drop initialDelaySeconds on the other two probes back to 0 (or remove the field). The startup probe already gates them, and a 15 s delay on top is dead time for nothing. The block above sets initialDelaySeconds on liveness/readiness so it works with or without the startup probe; in production you usually pick one strategy.

Reach for startupProbe only if 15 s of initialDelaySeconds isn't enough.

Common pitfalls

1. No readiness probe at all

New replicas are "ready" before informers sync. Rolling deploys have a brief no-synced-replica window. Fix: add cache sync to readiness.

2. Liveness check that calls the API server

API hiccup → every operator pod restarts simultaneously → cluster makes the outage worse. Fix: keep liveness local; use healthz.Ping.

3. `initialDelaySeconds: 0` on a slow-starting operator

Liveness fires before the operator finishes initialising, kubelet restarts the container, repeat. Stuck in CrashLoopBackOff. Fix: add initialDelaySeconds: 15-30 or use a startupProbe.

4. Single shared check on both probes

Liveness and readiness use the same function. Now any transient issue restarts the container instead of just removing it from Service. Fix: separate the checks. healthz.Ping for liveness; informer-sync + webhook-readiness for readiness.

5. Readiness depends on cluster-wide state

Your readiness probe checks "is the leader pod alive?" — but on the leader pod, that depends on the standbys' health. Cycle. Fix: readiness should depend only on this pod's state.

Pitfall cheat sheet

Symptom	Root cause	Fix
Pod stuck in `CrashLoopBackOff` immediately after deploy	`initialDelaySeconds: 0` on an operator that needs 10–15 s to warm up	Set `initialDelaySeconds: 15` on liveness, or add a `startupProbe`
Every operator pod in the cluster restarts during an API-server blip	Liveness check calls the Kubernetes API	Replace with `healthz.Ping`; move API-dependent checks to readiness
Rolling deploy briefly has zero synced replicas	No readiness probe, or `/readyz` returns 200 before cache sync	Register the flag-flip `cache sync readiness probe` from Step 4
`kubectl describe pod` shows `Readiness probe failed: timeout` under load	Probe handler blocks (e.g. calls `WaitForCacheSync` directly)	Switch to the production-grade pattern — handler reads an `atomic.Bool`
Admission webhook returns "connection refused" right after a deploy	No `webhook-server` readiness check, traffic routes to a pod before TLS is loaded	Add `mgr.GetWebhookServer().StartedChecker()` to readiness
Standby replica drops out of Prometheus scrapes during failover	Readiness gated on leadership	Both leader and standby should report ready; gate leader-only work inside `mgr.Add(runnable)` instead

Frequently Asked Questions

1. What is the difference between liveness and readiness probes?

Liveness asks "is this pod alive?" If it fails, kubelet restarts the container. Use it to detect deadlocks or stuck goroutines. Readiness asks "is this pod ready to do work?" If it fails, the pod stays running but is removed from Service endpoints. Use it to gate traffic until startup is complete. For an operator with no external traffic, readiness mainly gates the rolling-deploy strategy.

2. What does controller-runtime expose by default?

Manager.HealthProbeBindAddress serves two endpoints once it is set: /healthz for liveness and /readyz for readiness. The library default is "0" (probe server disabled), but the kubebuilder scaffold sets it to :8081. Both endpoints return 200 OK with no checks registered; you add specific checks with mgr.AddHealthzCheck and mgr.AddReadyzCheck. Without explicit checks they only mean "the process is up".

3. How do I add a custom health check?

Call mgr.AddHealthzCheck(name, checker) or mgr.AddReadyzCheck(name, checker) before mgr.Start. A checker is a func(req *http.Request) error - returning nil means healthy, returning an error makes the endpoint return 500 and the kubelet mark the probe failed. The simplest checker is healthz.Ping from sigs.k8s.io/controller-runtime/pkg/healthz; a richer one looks up state in your operator (cache sync flag, last reconcile timestamp, webhook cert presence).

4. When does an operator pod need a readiness probe?

Always - even though operators have no external traffic, readiness affects rolling deploys. Without a readiness probe, the new replica is added to "ready" pods the moment its container starts, even before its informer cache has synced. Kubernetes thinks the rolling deploy succeeded, terminates the old replica, and you get a brief window with no synced operator.

5. What checks should be in /readyz?

Cache sync (informers are synced), leader election (this replica is the active leader or a standby - both are "ready"), webhook certificate availability (if webhooks are enabled), and external dependency reachability if your operator needs an external API.

6. What checks should be in /healthz?

Things that justify restarting the pod. The simplest is healthz.Ping - just answer "yes I am alive". More advanced: a "no reconcile in N minutes" check that detects a deadlocked workqueue. Avoid putting expensive checks in /healthz; the probe runs frequently and a slow check delays detection.

7. How do I tune probe timing to avoid restart loops?

initialDelaySeconds (default 0): how long to wait after container start before the first probe - set to cover slow-start scenarios (e.g. 15s for an operator with many CRDs). periodSeconds (default 10): how often to probe. failureThreshold (default 3): consecutive failures before kubelet acts. timeoutSeconds (default 1): how long each probe can take. A common starting point: 15s initial, 10s period, 3 failures.

8. Should liveness probe call the Kubernetes API?

No - your liveness probe should be local to the pod. If liveness depends on the API server, an API server outage causes all your operator pods to restart simultaneously, making the outage worse. Use healthz.Ping for liveness and put API-dependent checks in readiness only.

9. Should I use a startupProbe for my operator?

Only if 15-30 seconds of initialDelaySeconds on the liveness probe isn't enough. A startupProbe suppresses both liveness and readiness until it succeeds, which is the cleanest way to handle slow-starting operators (many CRDs, large clusters, slow external backends). When you use it, drop initialDelaySeconds on liveness/readiness back to 0 - the startup probe already gates them, and adding more delay is dead time for nothing. The typical budget is periodSeconds: 5 * failureThreshold: 30 = 150 seconds of startup grace.

10. Why does my readiness probe time out under load?

The most common cause is a probe handler that itself blocks - for example, func(req *http.Request) error { mgr.GetCache().WaitForCacheSync(req.Context()); ... }. The first call blocks until the cache syncs, which can exceed the timeoutSeconds: 1 default, and kubelet sees a timeout instead of a clean 500. Fix: do the wait once in a background manager.RunnableFunc, set an atomic.Bool when it returns, and have the probe handler just read the flag in O(1). Bumping timeoutSeconds is a workaround, not a fix.

Summary

Liveness and readiness probes are how Kubernetes decides when to restart your operator and when to send it traffic. The controller-runtime Manager provides the endpoints; you register the checks. Default to healthz.Ping on liveness, the flag-flipped cache sync readiness probe on /readyz, and the webhook server startedchecker if you serve admission webhooks. Tune initialDelaySeconds to cover real startup time (or reach for a startup probe operator configuration), never put cross-pod or API-server checks in liveness, and verify behaviour with kubectl describe and curl /readyz?verbose.

Done correctly, probes give you clean rolling deploys, fast detection of stuck operators, and no restart cascades during cluster hiccups. Done wrong, they amplify outages and turn flaky pods into permanent crash loops.