Debugging Kubernetes Operators: kubectl, Logs, and Webhook Failures

Tech reviewed: Deepak Prasad
Debugging Kubernetes Operators: kubectl, Logs, and Webhook Failures

When an Operator misbehaves, you rarely need to attach a debugger on the first day. Most production incidents are visible from kubectl, Events, and manager logs—plus a short checklist for webhooks, CRD conversion, and Server-Side Apply (SSA). This guide walks you through that path in a deliberate order so you can narrow the blast radius before you open an IDE.

You can lean on the kubectl, Events, webhook, and TLS sections even if you have not read the deeper reconcile articles yet. If you are still building intuition for how often Reconcile runs or what RequeueAfter means in logs, Requeue, RequeueAfter, and error handling in controller-runtime is useful background—but not a hard prerequisite. The examples below assume you operate a manager built with controller-runtime or Operator SDK (or anything that behaves similarly).

If the operator is installed through Operator Lifecycle Manager (OLM), check Subscription, ClusterServiceVersion, and install Plan status alongside the manager Deployment: install or upgrade failures there often masquerade as “the controller does nothing” long before CR webhooks matter.

Kubernetes Operator debugging workflow showing how an engineer moves from a failing Custom Resource through API validation, manager logs, webhooks, RBAC, workloads, and reconciliation to identify the root cause


Debug flow quickstart

Later sections spell out admission webhooks, CRD conversion, SSA managedFields, and a symptom → command playbook you can skim during an incident.

First signals

  • If kubectl apply fails, inspect CRD validation, admission webhooks, and API errors before looking at controller code.
  • If the CR is accepted but never becomes Ready, inspect Conditions, Events, operator logs, and generated workloads.
  • If the operator Pod crashes or never becomes Ready, inspect startup flags, leader election, webhook certificates, and health probes.
  • If children exist but continuously drift, inspect Server-Side Apply field managers and ownership boundaries.
  • Always correlate a single CR name and namespace across Events, logs, and child resources before increasing log verbosity.

Classify the failure (API vs manager vs operand)

  1. API layer — create/update/delete rejected (webhook, CRD schema, SSA conflict).
  2. Manager layer — Pod crash loop, leader loss, reconcile panics, RBAC forbidden.
  3. Operand layer — Deployment unhealthy, PVC pending, image pull errors.

Work top-down: if kubectl apply of the CR never succeeds, fixing the Deployment behind the operator will not help until the API accepts the object.

Minimum facts to collect

Before you grep the codebase, capture:

  • Namespace and exact CustomResource name.
  • Operator Deployment name and ServiceAccount.
  • Whether the failure happens on create, update, delete, or only when status changes.
  • The timestamp of the first failure (helps correlate logs and Events).

Five commands to run first

Substitute your CR kind, namespaces, names, and operator Deployment:

bash
kubectl describe <cr-kind> <cr-name> -n <workload-ns>
kubectl get events -n <workload-ns> --sort-by=.metadata.creationTimestamp | tail -n 40
kubectl logs deploy/<operator-deployment> -n <operator-ns> --since=30m | tail -n 80
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl get endpoints -n <operator-ns>

Then narrow: add --field-selector involvedObject.name=<cr-name> on Events, open the relevant ValidatingWebhookConfiguration / MutatingWebhookConfiguration, or kubectl get crd <plural>.<group> -o yaml when errors mention conversion or OpenAPI validation.


Manager process: flags, log verbosity, and correlating logs

Common logging flags and environment variables

controller-runtime projects usually log through zap, controlled with flags such as --zap-log-level (and related --zap-* options) or LOG_LEVEL / ZAP_* environment variables when the chart wires them that way. Some binaries or sidecars still use klog, where -v controls verbosity. They are different stacks—check your Deployment args: and env to see which applies. Raise verbosity only for a short debugging window: higher levels show cache resyncs, webhook registration, and reconcile entry, but they flood production if left enabled.

Correlating namespace / name with reconcile lines

Structured logs should include the reconcile key. When you read logs, filter by the CR’s namespace and name:

bash
kubectl logs deploy/<operator-deployment> -n <operator-ns> --since=30m | grep -iE 'mynamespace|my-cr-name'

On Windows PowerShell you can use Select-String instead of grep. If your operator logs JSON, pipe through jq and select the fields your team standardized (controller, object, namespace, name).

Leader election, metrics, and webhook listen addresses

At startup, confirm the manager bound the metrics port, health probes, and webhook server (if in-process). A log line that says webhooks are starting but the Pod never becomes Ready often points to certificate or port conflicts—cross-check with health and readiness probes and the webhook section later in this guide.


Live cluster inspection with kubectl

kubectl describe on the Custom Resource

Start with kubectl describe on your CR. It shows Conditions, Events, finalizers, and the current reconciliation state exposed by the operator.

Pay attention to the relationship between the desired and observed state:

  • .metadata.generation increases when users change spec.
  • .status.observedGeneration tells you which generation the controller has processed.

If generation is ahead of observedGeneration, the controller may not be reconciling successfully, may be stuck, or may not have observed the latest update.

Not every CRD publishes status.observedGeneration; if the field is absent, rely on Conditions, Events, and manager logs keyed on the same namespace/name instead of treating missing observedGeneration as proof that nothing ran.

If Conditions are empty but you expect status updates, the controller may not have RBAC permissions on the /status subresource, or status patches may be failing. Check manager logs for is forbidden, conflict, or patch errors.

kubectl get events with useful filters

Events often contain the first visible symptom of an admission or reconcile failure, but remember that they are retained for a limited time and can drop quickly on busy clusters (including when events.k8s.io is in play with shorter TTLs). Capture them early during an incident.

List recent Events in a namespace:

bash
kubectl get events -n <workload-ns> --sort-by=.metadata.creationTimestamp

To focus on a specific object:

bash
kubectl get events -n <workload-ns> --field-selector involvedObject.name=<cr-name>

For operator-wide failures, also inspect the operator namespace because the controller itself, leader election, webhook startup, or other framework components may emit Events there.

Follow the operand path: Deployment, Pod, Service

A CR stuck in Provisioning, Pending, or another transitional phase often means the operator created the child resources but they failed to become healthy.

Inspect the generated Deployments, Pods, Services, PVCs, or Jobs. Common failures such as image pull errors, volume mount failures, scheduling problems, or quota exhaustion usually appear there—not inside Reconcile logic.

Stuck in Terminating: finalizers and cleanup

Objects remain in Terminating when a finalizer has not been removed or when cleanup logic never completes.

Inspect the deletion state:

bash
kubectl get <cr-kind> <name> -n <ns> -o yaml | grep -iE 'deletionTimestamp|finalizers'

A deletionTimestamp with a remaining finalizer means Kubernetes is waiting for the controller to complete its cleanup path.

Compare with finalizers explained to understand two-phase deletion. Removing a finalizer manually is a break-glass action—only do it when you understand what external resources or cleanup steps you are skipping.


Admission webhooks: validating and mutating failures

How the API server reaches your webhook

For each CREATE / UPDATE / DELETE, the Kubernetes API server may call your validating or mutating webhook over HTTPS. The chain includes: ValidatingWebhookConfiguration or MutatingWebhookConfigurationclientConfig.serviceServiceEndpointsPod listening with a certificate the API server trusts.

failurePolicy: Fail vs Ignore

  • Fail — webhook errors or timeouts reject the request; users see admission errors immediately.
  • Ignore — webhook failures are skipped; the request proceeds. That can hide misconfiguration until later—use sparingly outside development.

When debugging, temporarily understanding which policy is set tells you whether a 500 from the webhook should block applies.

caBundle, Service DNS, and TLS trust

The API server validates the webhook server certificate using caBundle on the webhook client config (or an equivalent reference mechanism). Common failures:

  • Empty caBundle after install.
  • Wrong CA—rotated cert without updating the bundle.
  • DNS SAN mismatch—certificate issued for svc.cluster.local name that does not match the Service the API server calls.

If you use cert-manager, verify the Certificate is Ready=True and that your operator’s install step copied the CA into the webhook configuration. The walkthrough in mutating and validating admission webhooks ties these pieces together.

cert-manager resources to inspect

bash
kubectl get certificate -n <operator-ns>
kubectl describe clusterissuer issuer-name

Look for failed challenges, pending orders, or secrets that never populate tls.crt / tls.key.

Typical API server error strings

You may see connection refused, x509: certificate signed by unknown authority, no endpoints available for service, or timeout. Each maps to a different layer—network, TLS trust, missing Endpoints, or slow reconcile inside the webhook handler.


Conversion webhooks after CRD upgrades

Symptoms when versions disagree

After introducing a new served API version or changing the storage version, clients may see conversion failures, missing fields after round-tripping, or kubectl get / kubectl apply errors mentioning the conversion webhook.

Remember that conversion is not only an upgrade-time operation. The API server may invoke the conversion webhook whenever an object stored in one version must be presented through another served version.

These failures occur before your controller sees the object, so debugging Reconcile logs usually does not help.

Inspect spec.conversion, versions, and caBundle

Start by inspecting the CRD:

bash
kubectl get crd <plural>.<group> -o yaml

Verify:

  • spec.conversion.strategy: Webhook is configured when you expect conversion.
  • The webhook clientConfig.service points to the correct namespace, Service name, and port.
  • caBundle matches the CA that signed the conversion webhook certificate.
  • The expected versions appear under spec.versions with the correct served and storage flags.

A common mistake is updating the CRD YAML but leaving the conversion Service or certificate configuration from the previous deployment.

The full rollout and migration pattern is covered in CRD version upgrades and conversion webhooks.

Ordering mistakes and rollout sequence

A CRD that references a conversion webhook before the webhook Pod, Service, and TLS material are ready can cause cluster-wide failures for that CRD.

A safer rollout sequence is:

  1. Deploy the webhook Service and Pod.
  2. Verify the serving certificate and caBundle.
  3. Apply the CRD changes that enable conversion.
  4. Allow users and controllers to access the new API versions.

When the webhook is reachable but conversion is wrong

Not all conversion failures are TLS or networking problems. The webhook may respond successfully but produce invalid objects.

Check for:

  • Fields silently dropped during version translation.
  • Missing default values after conversion.
  • Conversion code that does not preserve unknown or optional fields.
  • Validation failures after converting between versions.

Test round-trip conversions in CI (v1alpha1 → v1beta1 → v1alpha1) to ensure no user data is lost.


Server-Side Apply and field-manager conflicts

managedFields and conflicting managers

With Server-Side Apply (SSA), multiple actors can own different parts of the same object. A conflict happens when one manager tries to apply a field that another manager currently owns.

Inspect the live object:

bash
kubectl get deploy <child> -n <ns> -o yaml

Review metadata.managedFields and identify:

  • Which managers have modified the object.
  • Which operation (Apply vs Update) they used.
  • Which fields each manager currently owns.

Your operator should use a stable field manager name so ownership remains predictable across upgrades and restarts.

Do not immediately solve conflicts with --force-conflicts. That transfers field ownership and can silently overwrite changes made by another controller or user.

SSA conflicts vs optimistic concurrency

SSA conflicts are different from classic 409 Conflict errors.

  • SSA conflict: another field manager owns the field you are trying to apply.
  • 409 Conflict on Update: the object changed since you read it, so your resourceVersion is stale.

Mixing Update, patch operations, and SSA is possible, but define clear ownership boundaries. A common pattern is using SSA for child resources owned by the operator while avoiding full object Update calls that accidentally overwrite fields managed by others.

Server-Side Apply in operators covers strategies to avoid ownership thrashing.

Status vs spec subresources

spec and status represent different ownership domains. Users typically own spec, while controllers own status.

If failures only happen during status updates:

  • Verify the controller uses client.Status().Update() or client.Status().Patch() rather than modifying the full object.
  • Confirm RBAC includes permissions on the /status subresource.
  • Inspect managedFields to confirm status ownership is not unexpectedly shared.

Separate status updates also reduce accidental spec changes and avoid unnecessary reconciliation loops.


The operator playbook: symptom → cause → command

Use these tables as a first-pass checklist during incidents. They are not exhaustive, but they reflect the failures platform teams commonly see in production.

Manager not ready or crash looping

Symptom Likely cause What to run
Pod CrashLoopBackOff immediately after install Bad flags, missing webhook certificates, panic during startup kubectl logs -n <op-ns> deploy/<op> --previous
Readiness never passes Health probe failure, webhook server not listening, certificate problems kubectl describe pod -n <op-ns> -l control-plane=controller-manager (replace the label selector with the labels your Helm/OLM chart uses; Kubebuilder defaults often use control-plane=controller-manager)
Leader election repeatedly changes Lease RBAC, network partitions, unstable Pods kubectl get lease -n <op-ns> and inspect manager logs around leader transitions

CR accepted but no reconciliation happens

Symptom Likely cause What to run
kubectl apply succeeds but status never changes Controller not watching this GVK, namespace watch mismatch, operator version mismatch Check manager startup logs for watched resources; verify the CR namespace matches WATCH_NAMESPACE, OWN_NAMESPACE, or the cache scope your chart documents (namespace-scoped operators silently ignore CRs outside the watched set)
generation increases but status.observedGeneration never catches up (when your CR exposes it) Reconcile never runs or continuously fails before updating status Inspect logs for the CR namespace/name and verify Events

Reconcile errors and RBAC

Symptom Likely cause What to run
is forbidden in logs Missing ClusterRole permissions kubectl auth can-i create deployment --as=system:serviceaccount:<ns>:<sa> -n <workload-ns> (swap verb/resource for the failing call)
Status never updates Missing /status RBAC, status patch failure, conflict Inspect CR Events and manager logs for Forbidden, Conflict, or patch errors; confirm status RBAC with kubectl auth can-i update <plural>.<group> --subresource=status --as=system:serviceaccount:<ns>:<sa> -n <workload-ns>
Same object reconciles repeatedly without spec changes Status churn, unnecessary writes, noisy child watches Compare generation vs resourceVersion, inspect predicates and managedFields

API throttling and performance pressure

Symptom Likely cause What to run
Intermittent 429 Too Many Requests Excessive API traffic, high concurrency, or APF limits Inspect request rate, workqueue depth, and reduce unnecessary reads/writes before increasing QPS or APF shares

Webhook timeouts and TLS

Symptom Likely cause What to run
x509 or caBundle errors Certificate trust chain mismatch Inspect ValidatingWebhookConfiguration or MutatingWebhookConfiguration, verify caBundle, Service DNS, and Certificate status
connection refused / no endpoints available Service selector mismatch or webhook Pod not Ready kubectl get endpoints -n <op-ns> <webhook-service> and inspect the Service selector
Timeouts under load Slow webhook handler, API calls inside admission, CPU starvation Profile the handler and reduce external dependencies before only increasing timeout

CRD conversion and API version issues

Symptom Likely cause What to run
kubectl get or apply fails with conversion errors Conversion webhook unavailable or wrong caBundle kubectl get crd <plural>.<group> -o yaml and inspect spec.conversion, versions, and webhook configuration
Fields disappear after upgrade or round-trip conversion Conversion code does not preserve object data Test version round-trips (v1alpha1 → v1beta1 → v1alpha1) and compare the result

Server-Side Apply and ownership

Symptom Likely cause What to run
SSA conflicts Two field managers own the same field Inspect metadata.managedFields; avoid immediately using force apply
Unexpected field removal Full Update or competing controllers overwrite fields Check field managers, patch strategy, and ownership boundaries

Systematic log strategies (without drowning in noise)

Log levels in development vs production

See Common logging flags and environment variables above for zap (--zap-log-level) versus klog (-v).

In development, higher log verbosity helps you understand cache synchronization, watches, webhook registration, and reconcile flow. In production, prefer structured info-level logs with errors always visible, and increase verbosity only for short, targeted debugging windows.

The exact flags vary by logging implementation (zap, klog, or custom wrappers), so document the supported log-level controls in your operator deployment manifests.

Structured fields that make debugging easier

Every reconcile should leave a trace that lets you follow one object through the logs. Include fields such as:

  • namespace and name of the Custom Resource.
  • generation or observedGeneration in status when your CRD publishes it.
  • A reconcile ID or request identifier if your framework provides one.
  • The controller name and a high-level phase (Reconciling, Applying, Waiting, TerminalError).

Avoid logging entire Kubernetes objects at high frequency. Full object dumps are difficult to search, increase log volume, and may accidentally expose sensitive fields from Secrets or configuration.

Optional: Kubernetes audit logs

Audit logs are usually managed by cluster administrators, but they are extremely valuable when the question is who changed this object or why did the API server reject it?

Use audit logs to investigate:

  • Which user or ServiceAccount performed a create, update, or delete.
  • Whether an admission webhook rejected the request before it reached etcd.
  • Unexpected writes that continuously change a CR and trigger repeated reconciles.

Audit logs complement controller logs: the controller shows what it attempted to do, while audit logs show what requests reached the API server and how they were handled.


Frequently Asked Questions

1. Where do I start when my Custom Resource never reaches Ready?

Check the CR status and conditions, then kubectl describe on the CR and on the operator Deployment. Read recent Events in the namespace, then operator Pod logs filtered by the CR namespace and name. If creates or updates fail with webhook errors, inspect ValidatingWebhookConfiguration / MutatingWebhookConfiguration and the webhook Service and TLS Secret before you change reconciler code.

2. Why do I see TLS or x509 errors when applying a CR?

The API server calls admission or conversion webhooks over HTTPS and must trust the serving certificate. Empty or wrong caBundle, DNS mismatch on the certificate, or a webhook Pod that is not listening yet are the usual causes. cert-manager is a common way to keep caBundle in sync—verify the Certificate is Ready and the CA is injected into the webhook config.

3. What is the fastest way to tell RBAC from a logic bug?

RBAC denials often appear as is forbidden errors in manager logs or Events on the affected object. kubectl auth can-i helps confirm whether the operator ServiceAccount can perform the verb on the resource; for status-only failures, add --subresource=status to the can-i check. If permissions are correct but behavior is wrong, shift focus to reconcile logic, SSA field managers, or webhooks.

See also

If you want to go deeper on the ideas touched here, these tutorials in the Kubernetes Operators series are a good next step:

Upstream references

Bottom line: treat kubectl describe, Events, and manager logs as the first screen. If the API rejects changes, walk the webhook chain (configuration → Service → Endpoints → TLS). If versions drift, inspect CRD conversion and caBundle. If children look “almost right” but fields fight, open managedFields and revisit SSA ownership. That sequence resolves most Operator incidents without guessing.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with over a decade of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive experience, he excels across development, DevOps, …

  • Red Hat Certified System Administrator in Red Hat OpenStack
  • Certified Kubernetes Application Developer (CKAD)
  • Red Hat Certified Specialist in Ansible Automation
  • Go (programming language)
  • Python (programming language)
  • DevOps
  • Computer Security