The difference between an operator you can debug at 3am and one you cannot is metrics. controller-runtime gives you most of what you need for free — about thirty pre-wired metrics covering reconcile latency, workqueue depth, retry counts, and API call rates. The remaining work is wiring the scrape, picking the right four or five alerts, and adding custom metrics for the parts of your reconciler that the framework cannot see.
This guide is the production observability reference. It assumes you have already built a working operator and now want to know whether it is healthy in production. Skip the framework metrics at your peril — when an operator is misbehaving, those metrics are the ground truth.
This is part of the Advanced Capabilities chapter. Prerequisites: a scaffolded operator with controller-runtime, an understanding of the reconcile loop and workqueue, and health and readiness probes (the metrics endpoint shares the same Manager HTTP stack).
TL;DR — operator metrics in 60 seconds
controller-runtime exposes /metrics on :8080 by default. To scrape it in a production cluster running prometheus-operator:
- Expose
/metricsvia a Service. - Create a ServiceMonitor pointing at that Service.
- (Optional) Front it with kube-rbac-proxy for auth + TLS.
You inherit ~30 metrics for free. Add custom metrics by registering Prometheus collectors against the metrics.Registry exported by controller-runtime. Alert on five core signals: error rate, reconcile p99, workqueue depth, 429 rate, and operator-up. Build one Grafana dashboard that mirrors the alerts.
All of that fits in a 50-line Kubebuilder scaffold + 20 lines of ServiceMonitor YAML + 5 PromQL alert rules. The total observability story for an operator is small if you start from the framework, not from scratch.
A quick analogy: a car's dashboard
Every modern car has a dashboard that gives the driver four to six numbers in real time: speed, RPM, fuel level, engine temperature, oil pressure, and a few warning lights. The dashboard is not a comprehensive view of the car — under the hood there are dozens of sensors the driver never sees. But the dashboard is the operationally relevant subset: the numbers a competent driver needs to decide "keep driving" or "pull over".
Three properties of a good dashboard:
- Cardinality is bounded. Six numbers. Five warning lights. The driver can read them at a glance.
- Each number is actionable. Speed → adjust pedal. Fuel → find a station. Temperature → pull over now.
- Sensors run all the time. The dashboard reflects current state, not historical state. No buffering, no aggregation lag.
A Kubernetes operator's metrics endpoint plays the same role. Prometheus is the dashboard. The framework metrics are the standard sensors every car ships with. Custom metrics are the ones you add because your specific car (your operator's specific reconciler logic) has a behaviour that the manufacturer never anticipated.
| Car | Operator |
|---|---|
| Speedometer | controller_runtime_reconcile_total rate |
| Tachometer (RPM) | workqueue_depth |
| Fuel gauge | rest_client_requests_total rate |
| Engine temperature | controller_runtime_reconcile_time_seconds p99 |
| Oil pressure light | up{job="myoperator"} |
| Check-engine light | controller_runtime_reconcile_errors_total rate > 0 |
| Aftermarket OBD-II scanner | Custom Prometheus metrics |
| Mechanic plugging into the ECU | Profile dumps, pprof, debug logs |
The analogy holds at the failure mode too: a check-engine light tells you something is wrong but not what. You then go to the next level (the mechanic, profiling). Metrics are the alarm; deeper tools are the diagnosis. Do not try to put pprof on the dashboard.
Prerequisites
- A working operator scaffolded by
kubebuilderoroperator-sdkwith controller-runtime wired in. - A Prometheus instance scraping the cluster — typically prometheus-operator with kube-prometheus-stack.
- Familiarity with the reconcile loop and workqueue — the workqueue metrics are the most actionable signal in the catalogue.
- Familiarity with watches, events, and predicates — hot-loop and missed-event symptoms are diagnosed entirely through the metrics this article describes.
Why operator metrics matter
Three reasons it is worth treating metrics as part of the operator's contract, not an afterthought:
1. Without metrics, every reconcile problem looks the same
A misbehaving operator presents identically from the outside: CRs
stop converging. Without metrics, you cannot tell whether the
reconcile loop is wedged,
the workqueue is starving, the API server is throttling you (429s),
a webhook is timing out, or the leader is dead. With the four
framework metric families (reconcile, workqueue, rest_client,
leader_election), each of those five failure modes has a
distinct signature visible at a glance.
2. The controller_runtime hot loop is invisible to logs
The most common production pathology — a reconciler that fires on
every status write it makes — produces no log noise (everything is
"success"), no error rate (everything succeeds), and no test
failure. The only place it shows up is
controller_runtime_reconcile_total climbing at thousands per
minute and rest_client_requests_total{code="429"} ticking up
later as the API server pushes back. Metrics are the only place
you see this until your Prometheus pod OOMs from cardinality —
which is why the cheap five-alert pack at the bottom of this
article should be mandatory, not optional.
3. SRE/Platform teams will operate the operator, not the authors
Once an operator is past the toy stage, the people on call for it
are rarely the people who wrote it. Their entire mental model of
"is this operator healthy" is built on the dashboards and alerts
you ship. An operator without
health probes +
operator prometheus metrics + a runbook is, operationally,
a black box — and black boxes get rolled back at the first sign of
trouble rather than diagnosed.
What you get for free: the framework metrics
controller-runtime registers metrics into a global Prometheus registry. The full set is documented in the Kubebuilder book; the high-value subset is:
Reconcile metrics
# Counter: how many times Reconcile() has been called, by controller and outcome.
controller_runtime_reconcile_total{controller="memcached", result="success"}
controller_runtime_reconcile_total{controller="memcached", result="error"}
controller_runtime_reconcile_total{controller="memcached", result="requeue"}
controller_runtime_reconcile_total{controller="memcached", result="requeue_after"}
# Histogram: how long each Reconcile call took.
controller_runtime_reconcile_time_seconds{controller="memcached"}
# Counter: how many Reconcile calls returned an error.
controller_runtime_reconcile_errors_total{controller="memcached"}Use these to answer: "Is reconcile succeeding?" and "How long does reconcile take?"
Workqueue metrics
# Gauge: items waiting in the queue.
workqueue_depth{name="memcached"}
# Histogram: how long items waited before being processed.
workqueue_queue_duration_seconds{name="memcached"}
# Histogram: how long Reconcile (i.e. the work) took.
workqueue_work_duration_seconds{name="memcached"}
# Gauge: how long the oldest item in the queue has been there.
workqueue_unfinished_work_seconds{name="memcached"}
# Counter: items that were requeued for retry.
workqueue_retries_total{name="memcached"}Use these to answer: "Is the operator falling behind?" and "Are there CRs in retry hell?"
REST client metrics
# Counter: API calls made by the operator, by HTTP method and code.
rest_client_requests_total{method="GET", code="200"}
rest_client_requests_total{method="PATCH", code="409"}
rest_client_requests_total{method="PATCH", code="429"}
# Histogram: API call latency.
rest_client_request_duration_seconds{verb="GET", url="...."}Use these to answer: "Is the API server happy with us?" and "Are we being throttled?"
Leader election
# Gauge: 1 if this replica is the leader, 0 otherwise.
leader_election_master_status{name="memcached-operator"}Use to confirm
leader election is
functioning and you have exactly one leader at all times — the sum
of leader_election_master_status across replicas should always be
1. A 0 means the lease expired without a successor; a 2 means
split-brain. Both warrant immediate investigation.
Webhook metrics (if applicable)
controller_runtime_webhook_requests_total{webhook="/mutate-v1-memcached"}
controller_runtime_webhook_latency_seconds{webhook="/mutate-v1-memcached"}Wiring the metrics endpoint
main.go configuration:
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: server.Options{
BindAddress: ":8080",
SecureServing: false,
},
HealthProbeBindAddress: ":8081",
})For production, you want SecureServing = true with a TLS certificate. Kubebuilder's --secure-metrics flag scaffolds this with cert-manager-issued certificates and a kube-rbac-proxy sidecar:
# config/default/manager_auth_proxy_patch.yaml — generated by Kubebuilder
- name: kube-rbac-proxy
image: gcr.io/kubebuilder/kube-rbac-proxy:v0.18.0
args:
- "--secure-listen-address=0.0.0.0:8443"
- "--upstream=http://127.0.0.1:8080/"
- "--tls-cert-file=/var/run/secrets/serving-cert/tls.crt"
- "--tls-private-key-file=/var/run/secrets/serving-cert/tls.key"
- "--logtostderr=true"
- "--v=10"
ports:
- containerPort: 8443
name: https
protocol: TCPThe Service exposes port 8443; Prometheus scrapes that, authenticating with a token that has nonResourceURLs: ["/metrics"] permission.
Scrape configuration: ServiceMonitor
The cleanest scrape configuration uses prometheus-operator's ServiceMonitor CRD:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: memcached-operator
namespace: memcached-operator-system
labels:
release: prometheus # match the Prometheus CR selector
spec:
selector:
matchLabels:
app.kubernetes.io/name: memcached-operator
namespaceSelector:
matchNames:
- memcached-operator-system
endpoints:
- port: https
scheme: https
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
insecureSkipVerify: true # use a real CA in production
interval: 30s
relabelings:
- action: labeldrop
regex: (pod|service|endpoint)Three notes:
release: prometheusmust match the label your Prometheus CR'sserviceMonitorSelectorlooks for. Without that label, the ServiceMonitor is invisible.bearerTokenFileauthenticates Prometheus to kube-rbac-proxy. The ServiceAccount running Prometheus needs themetrics-readerClusterRole.relabelings: labeldropremoves high-cardinality labels (pod and service name) that would otherwise bloat your Prometheus storage.
Custom metrics: when and how
The framework metrics are about how the operator runs. Custom metrics are about what your reconciler is doing. Three kinds are worth adding:
1. Business metrics — count what the operator manages, in business terms. Example: a Memcached operator might emit memcached_ready_total{namespace} and memcached_failed_total{namespace} driven by the .status.conditions of each CR. For multi-tenant operators, the namespace label is the right tenant-bounded cardinality dimension.
2. Per-controller stage timings — instrument the slow parts of your reconciler. Example: a histogram for "time spent in external API call X".
3. Outcome counters — tally specific decisions. Example: a counter for "CR moved to Degraded state, by reason".
import (
"github.com/prometheus/client_golang/prometheus"
crmetrics "sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
memcachedReady = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "memcached_ready_total",
Help: "Number of Memcached CRs that are Ready, by namespace.",
},
[]string{"namespace"},
)
reconcileStageSeconds = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "memcached_operator_reconcile_stage_seconds",
Help: "Time spent in each reconcile stage.",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 12), // 1ms..4s
},
[]string{"stage"},
)
)
func init() {
crmetrics.Registry.MustRegister(memcachedReady, reconcileStageSeconds)
}
func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
defer func() {
reconcileStageSeconds.WithLabelValues("total").Observe(time.Since(start).Seconds())
}()
// ... your reconcile body ...
}The crmetrics.Registry is the registry controller-runtime uses for its own metrics. Registering custom collectors against it means they appear at the same /metrics endpoint with no extra wiring.
The five alerts every operator should have
# alerts.yaml — example PrometheusRule
groups:
- name: memcached-operator
rules:
- alert: OperatorReconcileErrors
expr: |
sum(rate(controller_runtime_reconcile_errors_total{controller="memcached"}[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: Operator reconcile errors are occurring
description: The memcached operator is returning errors at {{ $value }}/s.
- alert: OperatorReconcileSlow
expr: |
histogram_quantile(0.99,
sum by (le) (
rate(controller_runtime_reconcile_time_seconds_bucket{controller="memcached"}[5m])
)
) > 30
for: 10m
labels:
severity: warning
annotations:
summary: Operator p99 reconcile latency over 30s
description: '{{ $value }}s p99 — investigate slow API calls or expensive code paths.'
- alert: OperatorWorkqueueBacklog
expr: |
sum(workqueue_depth{name="memcached"}) > 100
for: 15m
labels:
severity: warning
annotations:
summary: Workqueue depth is climbing
description: 'Backlog of {{ $value }} items — operator cannot keep up.'
- alert: OperatorAPIThrottling
expr: |
sum(rate(rest_client_requests_total{code="429"}[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: API server throttling the operator
description: 'Increase client QPS/Burst or operator priority level.'
- alert: OperatorDown
expr: |
up{job="memcached-operator"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: Memcached operator is not reporting metrics
description: 'Pod is down or metrics endpoint is not reachable.'These five cover roughly 95% of real production incidents. Resist the urge to add twenty more — alert noise is the fastest way to make sure nobody pays attention to the ones that matter.
A starter Grafana dashboard
The dashboard mirrors the alerts. One row per signal:
- Reconcile rate and errors —
rate(controller_runtime_reconcile_total[5m])andrate(controller_runtime_reconcile_errors_total[5m]), stacked by result. - Reconcile latency —
histogram_quantile(0.5/0.95/0.99, ...)ofcontroller_runtime_reconcile_time_seconds_bucket. - Workqueue depth —
workqueue_depthandworkqueue_unfinished_work_secondsoverlaid. - API client rate by code —
rate(rest_client_requests_total[5m])by HTTP code. - Custom business metrics — your CR count, your specific outcomes.
The whole dashboard fits on one screen. Anything more is for deep-debug sessions and should live in a separate "diagnostic" dashboard.
Common pitfalls
Pitfall 1 — Labelling metrics with CR names.
// WRONG — explodes cardinality
reconcileTime.WithLabelValues(cr.Name).Observe(...)With 1000 CRs, each metric becomes 1000 time series. Per metric. Use namespace or kind as labels, never instance name. If you genuinely need per-CR observability, push events instead of metrics (Kubernetes Events have their own audit trail and storage model).
Pitfall 2 — Not exposing the metrics port via Service.
/metrics is bound on the pod, but Prometheus scrapes via Service discovery. A pod without a corresponding Service is invisible to ServiceMonitor-based scraping. Always check that kubectl get svc -n memcached-operator-system shows the operator Service.
Pitfall 3 — Forgetting the release label on ServiceMonitor.
Most prometheus-operator installations use a serviceMonitorSelector that requires a specific label (often release: prometheus). A ServiceMonitor without it is silently ignored — no scrape errors, no Prometheus targets, just nothing. Check kubectl get prometheus -o yaml for the selector.
Pitfall 4 — Alerting on workqueue depth without context.
A momentary spike in workqueue_depth is normal — operator restart, mass CR creation. Always use for: 10m or longer so transients are filtered out. The signal you care about is sustained backlog.
Pitfall 5 — Skipping up{} alerts.
up{job="myoperator"} == 0 is the cheapest, most reliable alert. The operator pod is down, the metrics endpoint is down, the network is partitioned — any one of these surfaces. Without this alert, all other alerts can be misleading because they require working metrics to fire.
Pitfall 6 — Hand-rolling Prometheus scrape configs.
If you have prometheus-operator installed (most teams do), ServiceMonitor is the right pattern. Editing prometheus.yaml directly works but loses you discovery, defaults, and lifecycle management. Pick one pattern per cluster and stick with it.
Pitfall 7 — No metric for "what I am supposed to be managing".
Framework metrics tell you about the operator's runtime. They do not tell you "how many of my CRs are healthy". Always add at least one custom gauge: mything_ready_count, mything_failed_count. Without it, a healthy-looking operator could be doing nothing useful.
Pitfall cheat sheet
| Symptom | Root cause | Fix |
|---|---|---|
| Prometheus pod OOMs after a CR-name label is added | Pitfall 1 — high-cardinality label (CR name, request ID) | Re-label with namespace / kind / result; emit per-CR signals as Kubernetes Events instead |
| ServiceMonitor created, no targets appear in Prometheus | Pitfall 3 — missing release: prometheus label on the ServiceMonitor (or whatever the Prometheus CR's serviceMonitorSelector requires) |
kubectl get prometheus -o yaml | grep -A4 serviceMonitorSelector and add the matching label |
/metrics works on the pod but Prometheus can't scrape it |
Pitfall 2 — no Service exposing the pod's metrics port | Create the Service, name the port https-metrics or metrics |
workqueue_depth alert pages on every operator restart |
Pitfall 4 — alert without a sustained-for window | Add for: 10m (or longer) so cold-start spikes are filtered out |
| Every alert was firing but the operator was actually down | Pitfall 5 — no up{} alert; downstream alerts rely on working scrape |
Add the up{job="myop"} == 0 alert at severity: critical |
| Dashboard shows zero CRs reconciling but the operator looks fine | Pitfall 7 — no business metric for "what is supposed to be managed" | Add a custom *_ready_total / *_failed_total gauge per Kind |
controller_runtime_reconcile_total climbs at thousands/min, no real change in CRs |
controller_runtime hot loop from missing GenerationChangedPredicate or unguarded Status().Update() |
See watches & predicates and status & conditions |
Further reading
- Kubebuilder book: Metrics
- prometheus-operator: ServiceMonitor API
- Prometheus client_golang: godoc
- kube-rbac-proxy: Brancz/kube-rbac-proxy
- Internal: controller-runtime architecture · the reconcile loop explained · watches, events, and predicates · status subresource & conditions · health and readiness probes · leader election explained · multi-tenancy patterns · RBAC minimum permissions · Kubernetes Operator tutorial — full course hub
Frequently Asked Questions
1. What metrics does controller-runtime expose by default?
Out of the box, controller-runtime exposes ~30 metrics in three families: (1) Reconcile metrics —controller_runtime_reconcile_total, controller_runtime_reconcile_time_seconds, controller_runtime_reconcile_errors_total. (2) Workqueue metrics — workqueue_depth, workqueue_queue_duration_seconds, workqueue_work_duration_seconds, workqueue_unfinished_work_seconds, workqueue_retries_total. (3) REST client metrics — rest_client_requests_total, rest_client_request_duration_seconds. Plus leader-election state, webhook latency (if you have webhooks), and Go runtime metrics (heap, goroutines, GC).2. How do I scrape my operator with Prometheus?
Three options: (1) prometheus-operator + ServiceMonitor — the de-facto standard on Kubernetes. Create a ServiceMonitor pointing at your operator's Service. (2) Manual Prometheus scrape config — add the operator to yourprometheus.yaml static_configs or use Kubernetes service discovery. (3) Annotations — the older prometheus.io/scrape: "true" pod annotation pattern still works if you use the kubernetes-pod-prometheus scrape job. ServiceMonitor is the cleanest for production.3. What is the metrics port and how do I expose it?
controller-runtime binds the metrics endpoint to:8080/metrics by default. Override with ctrl.Options{Metrics: server.Options{BindAddress: ":8080"}}. To expose it inside the cluster you need a Service pointing at the operator pod's port 8080 named (conventionally) https-metrics or metrics. The Service is what your ServiceMonitor or scrape config targets — not the pod directly.4. How do I secure the metrics endpoint?
Two layers: (1) TLS — controller-runtime can serve/metrics over HTTPS using a certificate from cert-manager (Metrics.SecureServing: true plus CertDir/CertName). (2) Authentication — front the endpoint with a kube-rbac-proxy sidecar that authenticates requests using TokenReview against the API server. The Prometheus ServiceAccount must have RBAC for nonResourceURLs: ["/metrics"]. Kubebuilder scaffolds the kube-rbac-proxy sidecar via the config/default/manager_auth_proxy_patch.yaml overlay; recent kubebuilder versions can also enable TLS directly inside controller-runtime without the proxy.5. How do I add custom metrics from my operator?
Register aprometheus.Counter, Gauge, or Histogram against the controller-runtime metrics registry: import "sigs.k8s.io/controller-runtime/pkg/metrics" and call metrics.Registry.MustRegister(myMetric) at init time. Then increment/observe inside Reconcile. The metric appears at /metrics alongside the framework metrics. Use the prometheus/client_golang types directly — no separate registry needed.6. What are the top five metrics I should alert on?
For most operators: (1)controller_runtime_reconcile_errors_total rate — non-zero error rate indicates real problems. (2) controller_runtime_reconcile_time_seconds p99 — slow reconciles indicate API throttling or expensive code. (3) workqueue_depth — sustained high depth indicates the operator cannot keep up. (4) rest_client_requests_total{code="429"} — the API server is throttling you. (5) up{job="myoperator"} — the operator pod is running.7. How do I avoid Prometheus cardinality explosions?
Three rules: (1) never label metrics with high-cardinality values (CR names, pod names, request IDs). Use labels for cardinality-bounded dimensions like result (success/error), namespace (bounded), kind (bounded). (2) Use histograms with reasonable bucket counts — 10-15 buckets is plenty. (3) Audit per-CR-instance metrics; an operator with 1000 CRs and a per-CR label has 1000× more time series. Prometheus stores them all; your Prometheus pod will OOM if you're not careful.8. How do I detect a reconcile hot loop with metrics?
A hot loop is a reconciler that fires on every status write it makes - producing no errors and no log noise, only acontroller_runtime_reconcile_total rate climbing into the thousands per minute with no corresponding change in CR count or spec. The cheapest detector is an alert on rate(controller_runtime_reconcile_total[5m]) > N where N is an order of magnitude above your steady state. The fix usually lives in the controller wiring - either a missing GenerationChangedPredicate (see watches & predicates) or an unguarded Status().Update() without an equality.Semantic.DeepEqual check (see status & conditions).9. Why is rest_client_requests_total{code="429"} non-zero on a healthy operator?
The Kubernetes API server uses API Priority and Fairness (APF) to throttle clients that exceed their allocated capacity. A non-zero 429 rate means your operator is being throttled - either it has too many concurrent reconciles for the API budget, or it is making redundant API calls inside Reconcile. Fix in this order: (1) check for a hot loop, (2) raise clientQPS and Burst on the rest.Config, (3) move to client-side caches with r.Get() instead of direct API calls inside Reconcile, (4) request a higher APF priority level for the operator's ServiceAccount. A sustained 429 alert at > 0.5/s is included in the five-alert pack below.Summary
controller-runtime gives you most of the observability you need out of the box. Thirty pre-wired metrics covering reconcile, workqueue, API client, and leader election are enough to answer 95% of "is my operator healthy" questions. The work is wiring the scrape (ServiceMonitor), picking five alerts (errors, latency, queue depth, 429s, operator-up), and adding two or three custom metrics specific to your reconciler.
The car-dashboard analogy is the whole mental model: a small set of high-value gauges, every one of them actionable. Resist the temptation to instrument everything; you will end up with dashboards nobody reads and alerts nobody trusts. Five alerts, one Grafana page, two custom metrics — and you have an operator you can run, debug, and trust at 3am.
The advanced capabilities tour ends here. With observability in place, you can confidently take the operator into production, scale it across tenants, multi-cluster, and the rest of Chapter 6 onwards (testing, CI/CD, OLM packaging) becomes the next frontier.

