Operator Metrics with Prometheus: Wiring controller-runtime for Observability

Last reviewed: by
Operator Metrics with Prometheus: Wiring controller-runtime for Observability

The difference between an operator you can debug at 3am and one you cannot is metrics. controller-runtime gives you most of what you need for free — about thirty pre-wired metrics covering reconcile latency, workqueue depth, retry counts, and API call rates. The remaining work is wiring the scrape, picking the right four or five alerts, and adding custom metrics for the parts of your reconciler that the framework cannot see.

This guide is the production observability reference. It assumes you have already built a working operator and now want to know whether it is healthy in production. Skip the framework metrics at your peril — when an operator is misbehaving, those metrics are the ground truth.

This is part of the Advanced Capabilities chapter. Prerequisites: a scaffolded operator with controller-runtime, an understanding of the reconcile loop and workqueue, and health and readiness probes (the metrics endpoint shares the same Manager HTTP stack).


TL;DR — operator metrics in 60 seconds

controller-runtime exposes /metrics on :8080 by default. To scrape it in a production cluster running prometheus-operator:

  1. Expose /metrics via a Service.
  2. Create a ServiceMonitor pointing at that Service.
  3. (Optional) Front it with kube-rbac-proxy for auth + TLS.

You inherit ~30 metrics for free. Add custom metrics by registering Prometheus collectors against the metrics.Registry exported by controller-runtime. Alert on five core signals: error rate, reconcile p99, workqueue depth, 429 rate, and operator-up. Build one Grafana dashboard that mirrors the alerts.

All of that fits in a 50-line Kubebuilder scaffold + 20 lines of ServiceMonitor YAML + 5 PromQL alert rules. The total observability story for an operator is small if you start from the framework, not from scratch.


A quick analogy: a car's dashboard

Every modern car has a dashboard that gives the driver four to six numbers in real time: speed, RPM, fuel level, engine temperature, oil pressure, and a few warning lights. The dashboard is not a comprehensive view of the car — under the hood there are dozens of sensors the driver never sees. But the dashboard is the operationally relevant subset: the numbers a competent driver needs to decide "keep driving" or "pull over".

Three properties of a good dashboard:

  1. Cardinality is bounded. Six numbers. Five warning lights. The driver can read them at a glance.
  2. Each number is actionable. Speed → adjust pedal. Fuel → find a station. Temperature → pull over now.
  3. Sensors run all the time. The dashboard reflects current state, not historical state. No buffering, no aggregation lag.

A Kubernetes operator's metrics endpoint plays the same role. Prometheus is the dashboard. The framework metrics are the standard sensors every car ships with. Custom metrics are the ones you add because your specific car (your operator's specific reconciler logic) has a behaviour that the manufacturer never anticipated.

Car Operator
Speedometer controller_runtime_reconcile_total rate
Tachometer (RPM) workqueue_depth
Fuel gauge rest_client_requests_total rate
Engine temperature controller_runtime_reconcile_time_seconds p99
Oil pressure light up{job="myoperator"}
Check-engine light controller_runtime_reconcile_errors_total rate > 0
Aftermarket OBD-II scanner Custom Prometheus metrics
Mechanic plugging into the ECU Profile dumps, pprof, debug logs

The analogy holds at the failure mode too: a check-engine light tells you something is wrong but not what. You then go to the next level (the mechanic, profiling). Metrics are the alarm; deeper tools are the diagnosis. Do not try to put pprof on the dashboard.

Operator metrics visualised as a car dashboard with four gauges and an alert light — RPM (controller_runtime_reconcile_total), fuel (process_resident_memory_bytes), engine temperature (controller_runtime_reconcile_errors_total), and speedometer (controller_runtime_reconcile_time_seconds p95) — with arrows pointing up to a Prometheus icon. A red "error budget burn" warning light sits below. Bounded cardinality, actionable readings, real-time sensors — the dashboard is the alarm; profiling is the diagnosis.


Prerequisites


Why operator metrics matter

Three reasons it is worth treating metrics as part of the operator's contract, not an afterthought:

1. Without metrics, every reconcile problem looks the same

A misbehaving operator presents identically from the outside: CRs stop converging. Without metrics, you cannot tell whether the reconcile loop is wedged, the workqueue is starving, the API server is throttling you (429s), a webhook is timing out, or the leader is dead. With the four framework metric families (reconcile, workqueue, rest_client, leader_election), each of those five failure modes has a distinct signature visible at a glance.

2. The controller_runtime hot loop is invisible to logs

The most common production pathology — a reconciler that fires on every status write it makes — produces no log noise (everything is "success"), no error rate (everything succeeds), and no test failure. The only place it shows up is controller_runtime_reconcile_total climbing at thousands per minute and rest_client_requests_total{code="429"} ticking up later as the API server pushes back. Metrics are the only place you see this until your Prometheus pod OOMs from cardinality — which is why the cheap five-alert pack at the bottom of this article should be mandatory, not optional.

3. SRE/Platform teams will operate the operator, not the authors

Once an operator is past the toy stage, the people on call for it are rarely the people who wrote it. Their entire mental model of "is this operator healthy" is built on the dashboards and alerts you ship. An operator without health probes + operator prometheus metrics + a runbook is, operationally, a black box — and black boxes get rolled back at the first sign of trouble rather than diagnosed.


What you get for free: the framework metrics

controller-runtime registers metrics into a global Prometheus registry. The full set is documented in the Kubebuilder book; the high-value subset is:

Reconcile metrics

text
# Counter: how many times Reconcile() has been called, by controller and outcome.
controller_runtime_reconcile_total{controller="memcached", result="success"}
controller_runtime_reconcile_total{controller="memcached", result="error"}
controller_runtime_reconcile_total{controller="memcached", result="requeue"}
controller_runtime_reconcile_total{controller="memcached", result="requeue_after"}

# Histogram: how long each Reconcile call took.
controller_runtime_reconcile_time_seconds{controller="memcached"}

# Counter: how many Reconcile calls returned an error.
controller_runtime_reconcile_errors_total{controller="memcached"}

Use these to answer: "Is reconcile succeeding?" and "How long does reconcile take?"

Workqueue metrics

text
# Gauge: items waiting in the queue.
workqueue_depth{name="memcached"}

# Histogram: how long items waited before being processed.
workqueue_queue_duration_seconds{name="memcached"}

# Histogram: how long Reconcile (i.e. the work) took.
workqueue_work_duration_seconds{name="memcached"}

# Gauge: how long the oldest item in the queue has been there.
workqueue_unfinished_work_seconds{name="memcached"}

# Counter: items that were requeued for retry.
workqueue_retries_total{name="memcached"}

Use these to answer: "Is the operator falling behind?" and "Are there CRs in retry hell?"

REST client metrics

text
# Counter: API calls made by the operator, by HTTP method and code.
rest_client_requests_total{method="GET", code="200"}
rest_client_requests_total{method="PATCH", code="409"}
rest_client_requests_total{method="PATCH", code="429"}

# Histogram: API call latency.
rest_client_request_duration_seconds{verb="GET", url="...."}

Use these to answer: "Is the API server happy with us?" and "Are we being throttled?"

Leader election

text
# Gauge: 1 if this replica is the leader, 0 otherwise.
leader_election_master_status{name="memcached-operator"}

Use to confirm leader election is functioning and you have exactly one leader at all times — the sum of leader_election_master_status across replicas should always be 1. A 0 means the lease expired without a successor; a 2 means split-brain. Both warrant immediate investigation.

Webhook metrics (if applicable)

text
controller_runtime_webhook_requests_total{webhook="/mutate-v1-memcached"}
controller_runtime_webhook_latency_seconds{webhook="/mutate-v1-memcached"}

Wiring the metrics endpoint

main.go configuration:

go
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    Metrics: server.Options{
        BindAddress: ":8080",
        SecureServing: false,
    },
    HealthProbeBindAddress: ":8081",
})

For production, you want SecureServing = true with a TLS certificate. Kubebuilder's --secure-metrics flag scaffolds this with cert-manager-issued certificates and a kube-rbac-proxy sidecar:

yaml
# config/default/manager_auth_proxy_patch.yaml — generated by Kubebuilder
- name: kube-rbac-proxy
  image: gcr.io/kubebuilder/kube-rbac-proxy:v0.18.0
  args:
    - "--secure-listen-address=0.0.0.0:8443"
    - "--upstream=http://127.0.0.1:8080/"
    - "--tls-cert-file=/var/run/secrets/serving-cert/tls.crt"
    - "--tls-private-key-file=/var/run/secrets/serving-cert/tls.key"
    - "--logtostderr=true"
    - "--v=10"
  ports:
    - containerPort: 8443
      name: https
      protocol: TCP

The Service exposes port 8443; Prometheus scrapes that, authenticating with a token that has nonResourceURLs: ["/metrics"] permission.


Scrape configuration: ServiceMonitor

The cleanest scrape configuration uses prometheus-operator's ServiceMonitor CRD:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: memcached-operator
  namespace: memcached-operator-system
  labels:
    release: prometheus  # match the Prometheus CR selector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: memcached-operator
  namespaceSelector:
    matchNames:
      - memcached-operator-system
  endpoints:
    - port: https
      scheme: https
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
      tlsConfig:
        insecureSkipVerify: true  # use a real CA in production
      interval: 30s
      relabelings:
        - action: labeldrop
          regex: (pod|service|endpoint)

Three notes:

  • release: prometheus must match the label your Prometheus CR's serviceMonitorSelector looks for. Without that label, the ServiceMonitor is invisible.
  • bearerTokenFile authenticates Prometheus to kube-rbac-proxy. The ServiceAccount running Prometheus needs the metrics-reader ClusterRole.
  • relabelings: labeldrop removes high-cardinality labels (pod and service name) that would otherwise bloat your Prometheus storage.

Custom metrics: when and how

The framework metrics are about how the operator runs. Custom metrics are about what your reconciler is doing. Three kinds are worth adding:

1. Business metrics — count what the operator manages, in business terms. Example: a Memcached operator might emit memcached_ready_total{namespace} and memcached_failed_total{namespace} driven by the .status.conditions of each CR. For multi-tenant operators, the namespace label is the right tenant-bounded cardinality dimension.

2. Per-controller stage timings — instrument the slow parts of your reconciler. Example: a histogram for "time spent in external API call X".

3. Outcome counters — tally specific decisions. Example: a counter for "CR moved to Degraded state, by reason".

go
import (
    "github.com/prometheus/client_golang/prometheus"
    crmetrics "sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
    memcachedReady = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "memcached_ready_total",
            Help: "Number of Memcached CRs that are Ready, by namespace.",
        },
        []string{"namespace"},
    )
    reconcileStageSeconds = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "memcached_operator_reconcile_stage_seconds",
            Help:    "Time spent in each reconcile stage.",
            Buckets: prometheus.ExponentialBuckets(0.001, 2, 12), // 1ms..4s
        },
        []string{"stage"},
    )
)

func init() {
    crmetrics.Registry.MustRegister(memcachedReady, reconcileStageSeconds)
}

func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    defer func() {
        reconcileStageSeconds.WithLabelValues("total").Observe(time.Since(start).Seconds())
    }()
    // ... your reconcile body ...
}

The crmetrics.Registry is the registry controller-runtime uses for its own metrics. Registering custom collectors against it means they appear at the same /metrics endpoint with no extra wiring.


The five alerts every operator should have

yaml
# alerts.yaml — example PrometheusRule
groups:
  - name: memcached-operator
    rules:

      - alert: OperatorReconcileErrors
        expr: |
          sum(rate(controller_runtime_reconcile_errors_total{controller="memcached"}[5m])) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Operator reconcile errors are occurring
          description: The memcached operator is returning errors at {{ $value }}/s.

      - alert: OperatorReconcileSlow
        expr: |
          histogram_quantile(0.99,
            sum by (le) (
              rate(controller_runtime_reconcile_time_seconds_bucket{controller="memcached"}[5m])
            )
          ) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Operator p99 reconcile latency over 30s
          description: '{{ $value }}s p99 — investigate slow API calls or expensive code paths.'

      - alert: OperatorWorkqueueBacklog
        expr: |
          sum(workqueue_depth{name="memcached"}) > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: Workqueue depth is climbing
          description: 'Backlog of {{ $value }} items — operator cannot keep up.'

      - alert: OperatorAPIThrottling
        expr: |
          sum(rate(rest_client_requests_total{code="429"}[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: API server throttling the operator
          description: 'Increase client QPS/Burst or operator priority level.'

      - alert: OperatorDown
        expr: |
          up{job="memcached-operator"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Memcached operator is not reporting metrics
          description: 'Pod is down or metrics endpoint is not reachable.'

These five cover roughly 95% of real production incidents. Resist the urge to add twenty more — alert noise is the fastest way to make sure nobody pays attention to the ones that matter.


A starter Grafana dashboard

The dashboard mirrors the alerts. One row per signal:

  1. Reconcile rate and errorsrate(controller_runtime_reconcile_total[5m]) and rate(controller_runtime_reconcile_errors_total[5m]), stacked by result.
  2. Reconcile latencyhistogram_quantile(0.5/0.95/0.99, ...) of controller_runtime_reconcile_time_seconds_bucket.
  3. Workqueue depthworkqueue_depth and workqueue_unfinished_work_seconds overlaid.
  4. API client rate by coderate(rest_client_requests_total[5m]) by HTTP code.
  5. Custom business metrics — your CR count, your specific outcomes.

The whole dashboard fits on one screen. Anything more is for deep-debug sessions and should live in a separate "diagnostic" dashboard.


Common pitfalls

Pitfall 1 — Labelling metrics with CR names.

go
// WRONG — explodes cardinality
reconcileTime.WithLabelValues(cr.Name).Observe(...)

With 1000 CRs, each metric becomes 1000 time series. Per metric. Use namespace or kind as labels, never instance name. If you genuinely need per-CR observability, push events instead of metrics (Kubernetes Events have their own audit trail and storage model).

Pitfall 2 — Not exposing the metrics port via Service.

/metrics is bound on the pod, but Prometheus scrapes via Service discovery. A pod without a corresponding Service is invisible to ServiceMonitor-based scraping. Always check that kubectl get svc -n memcached-operator-system shows the operator Service.

Pitfall 3 — Forgetting the release label on ServiceMonitor.

Most prometheus-operator installations use a serviceMonitorSelector that requires a specific label (often release: prometheus). A ServiceMonitor without it is silently ignored — no scrape errors, no Prometheus targets, just nothing. Check kubectl get prometheus -o yaml for the selector.

Pitfall 4 — Alerting on workqueue depth without context.

A momentary spike in workqueue_depth is normal — operator restart, mass CR creation. Always use for: 10m or longer so transients are filtered out. The signal you care about is sustained backlog.

Pitfall 5 — Skipping up{} alerts.

up{job="myoperator"} == 0 is the cheapest, most reliable alert. The operator pod is down, the metrics endpoint is down, the network is partitioned — any one of these surfaces. Without this alert, all other alerts can be misleading because they require working metrics to fire.

Pitfall 6 — Hand-rolling Prometheus scrape configs.

If you have prometheus-operator installed (most teams do), ServiceMonitor is the right pattern. Editing prometheus.yaml directly works but loses you discovery, defaults, and lifecycle management. Pick one pattern per cluster and stick with it.

Pitfall 7 — No metric for "what I am supposed to be managing".

Framework metrics tell you about the operator's runtime. They do not tell you "how many of my CRs are healthy". Always add at least one custom gauge: mything_ready_count, mything_failed_count. Without it, a healthy-looking operator could be doing nothing useful.

Pitfall cheat sheet

Symptom Root cause Fix
Prometheus pod OOMs after a CR-name label is added Pitfall 1 — high-cardinality label (CR name, request ID) Re-label with namespace / kind / result; emit per-CR signals as Kubernetes Events instead
ServiceMonitor created, no targets appear in Prometheus Pitfall 3 — missing release: prometheus label on the ServiceMonitor (or whatever the Prometheus CR's serviceMonitorSelector requires) kubectl get prometheus -o yaml | grep -A4 serviceMonitorSelector and add the matching label
/metrics works on the pod but Prometheus can't scrape it Pitfall 2 — no Service exposing the pod's metrics port Create the Service, name the port https-metrics or metrics
workqueue_depth alert pages on every operator restart Pitfall 4 — alert without a sustained-for window Add for: 10m (or longer) so cold-start spikes are filtered out
Every alert was firing but the operator was actually down Pitfall 5 — no up{} alert; downstream alerts rely on working scrape Add the up{job="myop"} == 0 alert at severity: critical
Dashboard shows zero CRs reconciling but the operator looks fine Pitfall 7 — no business metric for "what is supposed to be managed" Add a custom *_ready_total / *_failed_total gauge per Kind
controller_runtime_reconcile_total climbs at thousands/min, no real change in CRs controller_runtime hot loop from missing GenerationChangedPredicate or unguarded Status().Update() See watches & predicates and status & conditions

Further reading


Frequently Asked Questions

1. What metrics does controller-runtime expose by default?

Out of the box, controller-runtime exposes ~30 metrics in three families: (1) Reconcile metrics — controller_runtime_reconcile_total, controller_runtime_reconcile_time_seconds, controller_runtime_reconcile_errors_total. (2) Workqueue metrics — workqueue_depth, workqueue_queue_duration_seconds, workqueue_work_duration_seconds, workqueue_unfinished_work_seconds, workqueue_retries_total. (3) REST client metrics — rest_client_requests_total, rest_client_request_duration_seconds. Plus leader-election state, webhook latency (if you have webhooks), and Go runtime metrics (heap, goroutines, GC).

2. How do I scrape my operator with Prometheus?

Three options: (1) prometheus-operator + ServiceMonitor — the de-facto standard on Kubernetes. Create a ServiceMonitor pointing at your operator's Service. (2) Manual Prometheus scrape config — add the operator to your prometheus.yaml static_configs or use Kubernetes service discovery. (3) Annotations — the older prometheus.io/scrape: "true" pod annotation pattern still works if you use the kubernetes-pod-prometheus scrape job. ServiceMonitor is the cleanest for production.

3. What is the metrics port and how do I expose it?

controller-runtime binds the metrics endpoint to :8080/metrics by default. Override with ctrl.Options{Metrics: server.Options{BindAddress: ":8080"}}. To expose it inside the cluster you need a Service pointing at the operator pod's port 8080 named (conventionally) https-metrics or metrics. The Service is what your ServiceMonitor or scrape config targets — not the pod directly.

4. How do I secure the metrics endpoint?

Two layers: (1) TLS — controller-runtime can serve /metrics over HTTPS using a certificate from cert-manager (Metrics.SecureServing: true plus CertDir/CertName). (2) Authentication — front the endpoint with a kube-rbac-proxy sidecar that authenticates requests using TokenReview against the API server. The Prometheus ServiceAccount must have RBAC for nonResourceURLs: ["/metrics"]. Kubebuilder scaffolds the kube-rbac-proxy sidecar via the config/default/manager_auth_proxy_patch.yaml overlay; recent kubebuilder versions can also enable TLS directly inside controller-runtime without the proxy.

5. How do I add custom metrics from my operator?

Register a prometheus.Counter, Gauge, or Histogram against the controller-runtime metrics registry: import "sigs.k8s.io/controller-runtime/pkg/metrics" and call metrics.Registry.MustRegister(myMetric) at init time. Then increment/observe inside Reconcile. The metric appears at /metrics alongside the framework metrics. Use the prometheus/client_golang types directly — no separate registry needed.

6. What are the top five metrics I should alert on?

For most operators: (1) controller_runtime_reconcile_errors_total rate — non-zero error rate indicates real problems. (2) controller_runtime_reconcile_time_seconds p99 — slow reconciles indicate API throttling or expensive code. (3) workqueue_depth — sustained high depth indicates the operator cannot keep up. (4) rest_client_requests_total{code="429"} — the API server is throttling you. (5) up{job="myoperator"} — the operator pod is running.

7. How do I avoid Prometheus cardinality explosions?

Three rules: (1) never label metrics with high-cardinality values (CR names, pod names, request IDs). Use labels for cardinality-bounded dimensions like result (success/error), namespace (bounded), kind (bounded). (2) Use histograms with reasonable bucket counts — 10-15 buckets is plenty. (3) Audit per-CR-instance metrics; an operator with 1000 CRs and a per-CR label has 1000× more time series. Prometheus stores them all; your Prometheus pod will OOM if you're not careful.

8. How do I detect a reconcile hot loop with metrics?

A hot loop is a reconciler that fires on every status write it makes - producing no errors and no log noise, only a controller_runtime_reconcile_total rate climbing into the thousands per minute with no corresponding change in CR count or spec. The cheapest detector is an alert on rate(controller_runtime_reconcile_total[5m]) > N where N is an order of magnitude above your steady state. The fix usually lives in the controller wiring - either a missing GenerationChangedPredicate (see watches & predicates) or an unguarded Status().Update() without an equality.Semantic.DeepEqual check (see status & conditions).

9. Why is rest_client_requests_total{code="429"} non-zero on a healthy operator?

The Kubernetes API server uses API Priority and Fairness (APF) to throttle clients that exceed their allocated capacity. A non-zero 429 rate means your operator is being throttled - either it has too many concurrent reconciles for the API budget, or it is making redundant API calls inside Reconcile. Fix in this order: (1) check for a hot loop, (2) raise client QPS and Burst on the rest.Config, (3) move to client-side caches with r.Get() instead of direct API calls inside Reconcile, (4) request a higher APF priority level for the operator's ServiceAccount. A sustained 429 alert at > 0.5/s is included in the five-alert pack below.

Summary

controller-runtime gives you most of the observability you need out of the box. Thirty pre-wired metrics covering reconcile, workqueue, API client, and leader election are enough to answer 95% of "is my operator healthy" questions. The work is wiring the scrape (ServiceMonitor), picking five alerts (errors, latency, queue depth, 429s, operator-up), and adding two or three custom metrics specific to your reconciler.

The car-dashboard analogy is the whole mental model: a small set of high-value gauges, every one of them actionable. Resist the temptation to instrument everything; you will end up with dashboards nobody reads and alerts nobody trusts. Five alerts, one Grafana page, two custom metrics — and you have an operator you can run, debug, and trust at 3am.

The advanced capabilities tour ends here. With observability in place, you can confidently take the operator into production, scale it across tenants, multi-cluster, and the rest of Chapter 6 onwards (testing, CI/CD, OLM packaging) becomes the next frontier.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels across development, DevOps, …

  • Red Hat Certified System Administrator in Red Hat OpenStack
  • Certified Kubernetes Application Developer (CKAD)
  • Red Hat Certified Specialist in Ansible Automation
  • Go (programming language)
  • Python (programming language)
  • DevOps
  • Computer Security