Operator Leader Election Explained: HA Operators with Lease Locks

Last reviewed: by
Operator Leader Election Explained: HA Operators with Lease Locks

A single replica of your operator works fine until the node it runs on reboots, the pod gets evicted, or you roll out a bad image. The cluster then has no operator running, and CRs sit unreconciled until kubelet restarts something elsewhere — minutes of silent drift while reconcile work piles up. The fix is more than one replica, but two replicas of the same controller fighting over the same CRs is a recipe for split-brain. Leader election is what reconciles those two facts: run controller-runtime ha replicas for availability, but let exactly one be active at any time.

This article covers the lease-based lock, the four manager options, the three timing knobs, the Downward API pattern for clean lease identities, and how to actually test failover.


TL;DR — operator leader election in 60 seconds

Three changes to enable leader election in a kubebuilder-scaffolded operator (most of which kubebuilder init already wires for you):

go
// main.go
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    LeaderElection:                true,
    LeaderElectionID:              "memcached-operator.cache.example.com",
    LeaderElectionResourceLock:    "leases",      // default since Kubernetes 1.20
    LeaderElectionReleaseOnCancel: true,
})
yaml
# config/manager/manager.yaml
spec:
  replicas: 2     # or 3 - any number ≥ 2
  template:
    spec:
      containers:
      - name: manager
        args:
        - --leader-elect=true     # kubebuilder scaffolds this flag
yaml
# config/rbac/leader_election_role.yaml — scaffolded by kubebuilder
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get","list","watch","create","update","patch","delete"]

Now kubectl get pods -n memcached-operator-system shows two replicas; kubectl get lease -n memcached-operator-system shows one kubernetes lease coordination api object whose holderIdentity is the active leader. The rest of this article explains the lease object in detail, the four manager options, the three timing knobs, the Downward API trick for clean lease identities, the pitfall cheat sheet, and how to actually test failover.


A quick analogy: the on-call pager

A 24/7 ops team has a pager. The rules:

  1. Exactly one person holds the pager at any time. That person answers all incidents.
  2. The pager is handed off every shift. The new holder explicitly takes over.
  3. If the pager-holder is unreachable for 15 minutes, someone else picks it up. The team posts a Slack message ("@on-call missing — taking over") and the new person becomes the pager-holder.

Map this to leader election:

Pager rotation Operator leader election
The pager (a single physical object) A Lease object in etcd
Holding the pager Being the active leader
Hand-off at shift change Graceful release on shutdown
"Unreachable for 15 min → pick it up" LeaseDuration timeout → another replica wins
Multiple ops staff, only one paged Multiple replicas, only one runs controllers

The lease is the pager. The Kubernetes API server is the message board where the lease's holder is announced.

Kubernetes operator leader election visualised as an on-call pager rotation — three operator pods stand side by side, only the middle one holds the active pager labelled "Lease: holderIdentity=me, leaseDurationSeconds:15", and a standby pod is ready to pick it up if the renewTime expires. The Lease object in coordination.k8s.io is exactly that pager.


Prerequisites

  • A working operator currently running with replicas: 1 (the default scaffolded by kubebuilder init / operator-sdk init).
  • Familiarity with the controller-runtime architecture — the Manager is what owns the election loop and what controllers hang off of.
  • An understanding of the operator's RBAC — leader election needs coordination.k8s.io/leases create/get/update permissions on the operator's namespace.
  • Optional: Prometheus metrics for visibility into the failover.

Why leader election matters

Three reasons running a single-replica operator in production is the wrong default:

1. Availability — a single replica is a single point of failure

Every operator pod is the only thing reconciling its CRs. When that pod restarts (node drain, kernel oops, bad image rollout), the reconcile loop is silent for the duration. On a healthy cluster that is 10–30 seconds; on an unhealthy one (no available nodes, image pull failure, scheduler backoff) it can be minutes — long enough for SLOs to break and for drift to accumulate on every CR the operator manages.

2. Coordination — two replicas without leader election will fight

Running replicas: 2 without leader election is worse than running replicas: 1. Both controllers observe the same CRs, both try to write .status, both try to add/remove the same finalizers, both compete for the same .status.conditions field manager — and the API server returns a steady stream of 409 Conflict errors. Even with Server-Side Apply holding field ownership constant, two field managers writing identical state still racks up retries and obscures the real reconcile history. Leader election makes "one writer at a time" a structural guarantee.

3. Failover is bounded, not optional

With leader election enabled, failover is bounded by `LeaseDuration

  • RetryPeriod` — ~17 seconds on defaults, and configurable. Without it, failover is bounded by "however long it takes to schedule a fresh pod" — which is unbounded under cluster pressure. The whole point is to convert an availability problem (waiting for a fresh pod) into a latency problem (waiting one lease duration).

Step 1: What the Lease object looks like

After leader election starts, the cluster has a Lease like this:

bash
kubectl get lease memcached-operator.cache.example.com -n memcached-operator-system -o yaml
yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: memcached-operator.cache.example.com
  namespace: memcached-operator-system
spec:
  holderIdentity: memcached-operator-controller-manager-7c9b-abc12_a1b2c3d4-...
  leaseDurationSeconds: 15
  acquireTime: "2026-06-01T07:42:11.123Z"
  renewTime:   "2026-06-01T07:55:33.456Z"
  leaseTransitions: 3

Five fields with concrete meaning:

Field Meaning
holderIdentity The leader's stable name + a per-process UID. The pod name is in there if you wired the Downward API.
leaseDurationSeconds The leader has to renew within this window or someone else takes over.
acquireTime When this leader first acquired the lease.
renewTime When the leader last renewed. The lease is "alive" until renewTime + leaseDurationSeconds.
leaseTransitions How many times leadership has changed hands. A high number on a small cluster is suspicious — probably a flaky leader.

The whole election protocol is implemented as atomic updates of this one object. Whoever can update the Lease (with the right resourceVersion) becomes the holder. Everyone else loses the race and retries.


Step 2: The election loop, conceptually

Inside controller-runtime, the Manager runs a small loop on every replica from the moment mgr.Start(ctx) is called:

text
loop forever:
    lease = Get(LeaseName)
    if lease does not exist:
        Create(lease, holder = me)        # I am now the leader
    elif lease.renewTime + leaseDurationSeconds > now:
        # Lease is alive
        if lease.holder == me:
            UpdateRenewTime(lease, now)   # I keep being the leader
        else:
            sleep(RetryPeriod)            # Stand by
    else:
        # Lease expired - claim it
        Update(lease, holder = me)        # I am now the leader
    sleep(RetryPeriod or RenewDeadline)

The same loop drawn as a state machine — every replica is always in exactly one of these states, and the only way between them is an atomic update to the Lease object in etcd:

Kubernetes operator leader election lifecycle state machine — a single replica transitions between Starting → Acquiring (racing to write the Lease) → Leader (renewing renewTime every RetryPeriod, running controllers) → Standby (Lease alive but held by someone else, polling every RetryPeriod) → Expired (renewTime + leaseDurationSeconds < now, back to Acquiring). On SIGTERM the Leader takes a Releasing path that explicitly clears the holderIdentity when LeaderElectionReleaseOnCancel is true, shortening failover from LeaseDuration (~15 s) down to RetryPeriod (~2 s). Every transition is a single etcd write on the coordination.k8s.io/v1 Lease.

Every transition is a single atomic update to the Lease object — the API server's optimistic-locking is what guarantees there is at most one Leader at any instant.

Three things to internalise:

  • All "elections" are races to update an etcd object. Whoever wins the API server's optimistic-locking race becomes the leader.
  • Only the active leader runs controllers. Non-leaders run nothing useful — they just keep trying to acquire the lease.
  • Failover is bounded by LeaseDuration. When the leader vanishes, the longest a standby will wait is roughly LeaseDuration + RetryPeriod (default 17 s).

Step 3: The four manager options

Option Type Purpose
LeaderElection bool Master switch — turns the whole thing on
LeaderElectionID string The Lease's name; must be unique per controller
LeaderElectionNamespace string Namespace the Lease lives in (defaults to the operator's own)
LeaderElectionResourceLock string Lock backend: leases (default), configmaps, endpoints
LeaderElectionReleaseOnCancel bool Whether the leader explicitly releases on shutdown

LeaderElectionID

Treat it like a global identifier for this operator:

go
LeaderElectionID: "memcached-operator.cache.example.com"

Rules:

  • DNS-compatible (lowercase, dots and dashes OK, no underscores).
  • Stable across releases. Renaming triggers a brief "no leader" window while old replicas hold the old name and new replicas hold the new one.
  • Unique within the cluster if two different operators happen to use the same namespace. Defensive practice: include the API group as a suffix.

LeaderElectionNamespace

By default, the Lease is created in the same namespace as the operator's pod. Override if you need cross-namespace leadership — for example, an operator pod running in kube-system whose Lease should live somewhere the cluster admins watch.

LeaderElectionReleaseOnCancel: true

A subtle but important quality-of-life flag. When true:

  • On SIGTERM, the leader sends an update setting itself as the former holder (effectively expiring the lease now).
  • The next replica picks up leadership in RetryPeriod (~2 s) instead of waiting for LeaseDuration (~15 s) to expire.

This is the leaderelectionreleaseoncancel rolling deploy pattern, and it is what you want for fast rolling deploys. Without it, every rolling restart adds 15 s of "no leader" downtime — long enough that one missed reconcile per deploy is realistic on a busy cluster.


Step 4: The three timing knobs

LeaseDuration, RenewDeadline, RetryPeriod. The defaults (15 s / 10 s / 2 s) are conservative and fine for almost everything. They live in the underlying leaderelection.LeaderElectionConfig:

go
import "k8s.io/client-go/tools/leaderelection"

mgr, _ := ctrl.NewManager(cfg, ctrl.Options{
    LeaderElection:                true,
    LeaderElectionID:              "memcached-operator.cache.example.com",
    LeaderElectionReleaseOnCancel: true,
    LeaseDuration:                 ptr(30 * time.Second), // override
    RenewDeadline:                 ptr(20 * time.Second),
    RetryPeriod:                   ptr(5 * time.Second),
})

The one invariant that must always hold:

RetryPeriod < RenewDeadline < LeaseDuration

Violating it produces split-brain windows or wedged leaders. The client-go leaderelection package validates this at startup and refuses to start with mis-ordered values — but it is still worth understanding why, because every choice on these three knobs is a trade-off between failover speed and tolerance for API jitter.

  • LeaseDuration — how long until the lock "expires" after the last renew. Shorter = faster failover but more API server pressure.
  • RenewDeadline — the leader gives up its own claim if it cannot renew within this time. Shorter than LeaseDuration guarantees that a slow/partitioned leader steps down before someone else takes over (preventing split-brain).
  • RetryPeriod — how often non-leaders retry, and how often the leader renews. Lower values mean faster failover but more API calls per replica.

When should you tune them?

  • Fast failover (5/3/1 s) — only for sub-10s-failover-critical operators on extremely reliable clusters. The trade-off is more API calls per replica per minute and more sensitivity to API server hiccups.
  • Lazy failover (60/40/10 s) — for very stable workloads where rare API server outages should not cause leadership flips. Don't go past 60 s without a good reason.

Step 5: Stable lease holder identity via the Downward API

By default, controller-runtime derives the lease holder identity from os.Hostname() (which is the pod name in Kubernetes) plus a per-process UUID. That works but produces opaque logs after pod restarts. The downward api pod_name lease pattern gives you an explicit, kubectl-greppable identity:

yaml
spec:
  template:
    spec:
      containers:
      - name: manager
        env:
        - name: POD_NAME
          valueFrom: { fieldRef: { fieldPath: metadata.name } }
        - name: POD_NAMESPACE
          valueFrom: { fieldRef: { fieldPath: metadata.namespace } }

In main.go:

go
hostname, _ := os.Hostname()
if pn := os.Getenv("POD_NAME"); pn != "" {
    hostname = pn  // prefer Downward API
}

The Lease's holderIdentity now reads memcached-operator-controller-manager-7c9b-abc12_a1b2c3... (pod name + uid) instead of the short hostname. Operators-of-operators debugging "which pod is leading?" get a direct kubectl-ready identifier — just awk -F_ off the suffix and you have the pod name to kubectl logs against.


Step 6: Watching the failover

Two terminals, run side by side.

Terminal 1 — watch the Lease:

bash
kubectl get lease memcached-operator.cache.example.com -n memcached-operator-system -w

You should see a row updated every ~2 s (the active leader's renew). Note the HOLDER column.

Terminal 2 — kill the leader (Linux / macOS shell — on Windows PowerShell, split holderIdentity on _ manually):

bash
LEADER=$(kubectl get lease memcached-operator.cache.example.com -n memcached-operator-system -o jsonpath='{.spec.holderIdentity}' | awk -F_ '{print $1}')
kubectl delete pod $LEADER -n memcached-operator-system

Watch the Lease again. Within ~15 s, the HOLDER column flips to the standby replica's name. Within another second the new leader starts logging "starting workers".

Watching with kubectl events

A more readable alternative to parsing the lease object is to tail events scoped to the Lease — controller-runtime emits a LeaderElection event each time leadership changes hands:

bash
kubectl events -n memcached-operator-system \
  --for lease/memcached-operator.cache.example.com --watch
text
LAST SEEN   TYPE     REASON           OBJECT                                              MESSAGE
0s          Normal   LeaderElection   Lease/memcached-operator.cache.example.com          memcached-operator-controller-manager-7c9b-xyz12_a1b2... became leader

Production should also alert on the lease transitions counter (leader_election_master_status from controller-runtime metrics, or the leaseTransitions field on the Lease itself) climbing unexpectedly — a slow steady climb means a flaky leader, while a single bump per deploy is the expected pattern.


Step 7: Standby behaviour

Standby replicas:

  • Still run the Manager's health/metrics HTTP server. Useful so they pass readiness probes during a deploy.
  • Still expose /metrics, /healthz, /readyz — Prometheus doesn't care which replica holds the lease.
  • Do not start controllers and do not subscribe to watches: the Manager's LeaderElectionRunnables (including the informer cache) only start once the lease is acquired.
  • Are essentially "idle until needed".

A subtle but important implication: standbys do not warm their informer cache. On failover, the new leader's first reconciles are slightly slower while the cache catches up (typically 1–3 s on a healthy cluster) — which is also why standbys should still report ready: marking them not-ready does not help here and only widens the no-traffic window. For tight SLOs, consider running specific read-only Runnables with RunNonLeaderElected: true so they warm in the background, but never do this for reconcilers that mutate state (it defeats the whole point of leader election).

If your operator uses Server-Side Apply for writes, leader election is doubly important: SSA tracks field ownership by fieldManager, and two replicas applying with the same fieldManager will still produce a fight-then-converge pattern rather than the clean single-writer story SSA is designed for. Pick one strategy — leader election with a single fieldManager, or sharding with per-shard managers — never both writers at once.


Step 8: When NOT to use leader election

A few cases where single-replica is the right call:

  • Stateless operators with very fast reconciles. A 100 ms reconcile time on 10 CRs means failover takes ~17 s — about as long as just starting a fresh pod from scratch. Two replicas with no leader election (and accepting split-brain risk) is not acceptable; one replica with restartPolicy: Always and a fast startup is fine.
  • Dev / lab environments. Leader election just complicates make run. Disable in dev (--leader-elect=false), enable in staging/prod.
  • Operators with intentional sharding. If you partition CRs by label and run N replicas, each handling its slice, leader election is unnecessary — they aren't competing for the same CRs. See operator multi-tenancy patterns.

For everything else — managed workloads, lifecycle CRs, anything where state coherence matters — leader election is non-optional in production.


Common pitfalls

1. Two operators sharing the same LeaderElectionID

Symptom: both operators silently fight over the lease, neither makes consistent progress. Fix: include the API group in the ID (memcached-operator.cache.example.com).

2. RBAC missing for leases

Symptom: leader election immediately fails with leases.coordination.k8s.io "..." is forbidden. Fix: ensure the Role/RoleBinding for the operator's ServiceAccount includes lease permissions.

3. Forgetting LeaderElectionReleaseOnCancel

Every rolling restart causes ~15 s of "no leader". The next replica takes over after LeaseDuration expires. Fix: set LeaderElectionReleaseOnCancel: true so leaders release the lease explicitly on SIGTERM.

4. Aggressive timings on flaky clusters

You set LeaseDuration: 3 * time.Second. Every API server hiccup flips leadership. The reconcile metric is now a sawtooth. Fix: revert to defaults; if you need fast failover, fix the API server flakiness first.

5. Two-replica deploy without leader election

Symptom: both reconcilers write to the same CR, conflicts everywhere, finalizers fight, operator logs full of Conflict errors. Fix: either set replicas: 1, or enable leader election. Never two of the same controller without coordination.

Pitfall cheat sheet

Symptom Root cause Fix
Two operators in the same namespace silently fight, neither makes progress Pitfall 1 — both use the same LeaderElectionID Suffix the ID with the API group (memcached-operator.cache.example.com)
Operator crashes at startup with leases.coordination.k8s.io "..." is forbidden Pitfall 2 — RBAC for coordination.k8s.io/leases missing Apply the leader_election_role.yaml from kubebuilder scaffold
Every rolling deploy has ~15 s of "no leader" Pitfall 3 — LeaderElectionReleaseOnCancel: false (or unset) Enable LeaderElectionReleaseOnCancel: true
Leadership flips every few minutes on a healthy cluster Pitfall 4 — LeaseDuration set aggressively (<5 s) on a non-perfect API server Revert to defaults (15/10/2 s); fix API server latency before tuning
Operator log full of 409 Conflict errors, finalizers added and removed in a loop Pitfall 5 — two replicas, no leader election Enable leader election, or scale to replicas: 1
leaseTransitions counter grows steadily even with no deploys Slow/partitioned leader can't meet RenewDeadline Check the leader's CPU throttling and API latency before tuning timings

Frequently Asked Questions

1. Why does an operator need leader election?

If two replicas of the same controller run at once they both observe the same CRs and both try to reconcile them - duplicate reconciles, racing API writes, conflicting finalizers, possible split-brain on state machines. Leader election ensures exactly one replica is the active leader at any time; the others stand by and take over when the leader fails.

2. What is a Lease in Kubernetes?

A coordination.k8s.io/v1.Lease is a tiny lightweight object whose only purpose is to record "who holds this lock right now, and when does the lock expire". Leader election uses one Lease per controller-manager: each replica fights to write itself as the holder; whoever wins becomes the leader.

3. How do I enable leader election in controller-runtime?

ctrl.Options.LeaderElection: true plus a unique LeaderElectionID (a DNS-compatible name like memcached-operator.cache.example.com). The Manager creates the Lease in the operator's namespace and runs the election loop automatically. No extra code needed in the reconciler.

4. What are LeaseDuration, RenewDeadline, RetryPeriod?

Three timing knobs. LeaseDuration (default 15s) is how long the lock stays valid after the last successful renew - i.e. how long until a dead leader is forgotten. RenewDeadline (default 10s) is how long the leader has to renew before it gives up its own claim. RetryPeriod (default 2s) is how often non-leaders try to acquire the lock. Defaults are fine; tune only if you understand the failover-vs-stability trade-off.

5. What lock backend types are available?

Three: leases (recommended, default since Kubernetes 1.20), configmaps, and endpoints. The latter two are legacy and still work, but leases is purpose-built (it uses the coordination.k8s.io API, less API server noise). Always use leases unless you are integrating with very old tooling.

6. Can the standby replicas do anything?

By default no - they run the manager but no controllers are started. The LeaderElectionReleaseOnCancel option ensures the leader gives up the lease quickly on shutdown (rather than waiting LeaseDuration to expire). Some manager setups (mostly cluster-api) run some "always-on" work even on non-leaders by passing RunNonLeaderElected: true to specific Runnables.

7. How do I get a stable identity for the lease holder?

Inject the pod name via the Downward API: yaml env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name . In Go, controller-runtime defaults to hostname + uid; if you want pod name explicitly, override Manager.LeaderElectionIdentity (this is mostly cosmetic - the kube-system logs read better with a real pod name).

8. How do I test leader-election failover?

Deploy 2-3 operator replicas, watch the Lease holder with kubectl get lease <id> -w. Kill the leader with kubectl delete pod <leader-name>. Within LeaseDuration + RetryPeriod (~17 s with defaults) one of the standbys becomes the new leader. Test that no duplicate reconciles happen by checking your reconcile metrics.

9. Why does my non-leader replica still use CPU and memory?

Standbys are not free. Each runs the controller-runtime Manager (the HTTP server for /metrics, /healthz, /readyz), the leader-election loop (a Lease Get + sleep every RetryPeriod), and any RunNonLeaderElected: true Runnables you registered. They do not run controllers and do not warm the informer cache, so memory is small (typically 30-80 Mi) and CPU is tiny (1-5 m per replica), but it is non-zero. Budget for it in your Deployment requests and limits.

10. Do I still need leader election if I use sharding?

No, but only if your shards never overlap. If you partition CRs by label (e.g. one replica per tenant, or one replica per region) and the partitioning is strict, each replica owns its slice and there is nothing to fight over. The moment two replicas could observe the same CR - even briefly, during a relabel or a controller restart - you need leader election back. The safer pattern is sharding + per-shard leader election: each shard gets its own Lease with a shard-specific LeaderElectionID. See operator multi-tenancy patterns for the full sharding playbook.

Summary

Leader election is the standard pattern for high-availability operators: enable LeaderElection: true with a unique LeaderElectionID, use the leases backend, set LeaderElectionReleaseOnCancel: true for fast rolling deploys, deploy with replicas: 2 or 3, and accept the 15-second worst- case failover window. The defaults work; tune the timings only when you know exactly which trade-off you are buying.

For most operators, leader election is the difference between "the operator goes down whenever a node reboots" and "the operator keeps reconciling because there's always another replica ready."


Further reading

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels across development, DevOps, …

  • Red Hat Certified System Administrator in Red Hat OpenStack
  • Certified Kubernetes Application Developer (CKAD)
  • Red Hat Certified Specialist in Ansible Automation
  • Go (programming language)
  • Python (programming language)
  • DevOps
  • Computer Security