Operator pod not ready

Talks about: lifecycle, runbooks, and troubleshooting

Linked from the JaaSOperatorPodDown alert. Fires when at least one jaas pod has been Ready=False for the alert window (default 5m).

Symptom

ALERTS{alertname="JaaSOperatorPodDown", namespace="<jaas-ns>"}

New JsonnetSnippets stay Ready=Unknown indefinitely (no controller reconciling them).
Existing snippets keep serving stale ExternalArtifact content (the storage HTTP server may still respond; reconciliation is the part that stopped).

Cause

One of the chart’s two probes is failing:

Liveness (/live) is unconditional 200 — a failure here means the pod’s HTTP server itself isn’t responding (deadlock, OOM, panic).
Readiness (/ready) consults HealthState, which goes false during drainBeforeShutdown or before the listeners bind.
Startup (/start) returns 503 until MarkStarted() is called — bind failures (port already in use, permission denied) keep the pod stuck here forever.

Frequent causes:

Bind failure on one of the HTTP servers (jsonnet, management, storage, metrics, webhook). The pod logs a clear “listen tcp: address already in use” or similar at boot.
OOMKilled — a pathological snippet allocated a huge object; the kubelet killed the pod. kubectl describe pod shows Last State: Terminated, Reason: OOMKilled.
Image pull failure — registry rate limit, wrong tag, missing pull secret.
TLS cert missing or unreadable when operator.webhook.enabled=true and the cert-manager Secret hasn’t materialized.
Lease contention that leaves no replica as leader (every replica reconnecting to renew, never holding the lease).

Diagnosis

# Which probes are failing? Events tell you.
kubectl --namespace <jaas-ns> describe pod --selector app.kubernetes.io/name=jaas

# Pod logs — the boot sequence prints every listener it binds.
kubectl --namespace <jaas-ns> logs --selector app.kubernetes.io/name=jaas --tail=300

# Compare against the expected listener set.
kubectl --namespace <jaas-ns> get svc --selector app.kubernetes.io/name=jaas

For OOM:

kubectl --namespace <jaas-ns> top pod --selector app.kubernetes.io/name=jaas
kubectl --namespace <jaas-ns> get pod --selector app.kubernetes.io/name=jaas --output yaml \
  | grep -A3 lastState

For lease problems (multi-replica only):

kubectl --namespace <jaas-ns> get lease <release-name>-operator --output yaml

holderIdentity flipping every renewal interval is a sign of network flake or apiserver pressure — the replicas can’t keep the lease stable.

Remediation

Bind failure. Free the colliding port (often 8080, when the controller-runtime metrics endpoint defaults conflict with the jsonnet HTTP port — confirm --metrics-bind-address is :8083).
OOMKilled. Raise resources.memory, then identify the runaway snippet (the bench in internal/operator/bench_test.go is a regression baseline; the runaway is usually obvious from jaas_snippet_rendered_bytes).
Image pull. Standard k8s drill: check secrets, registry, tag.
TLS cert. With certMode=cert-manager, confirm the Issuer / Certificate are ready. With certMode=self-signed, the operator regenerates on boot — a permission error on the cert-dir mount blocks it.
Lease flap. Try kubectl --namespace <jaas-ns> delete lease <release-name>-operator to force a fresh election. If it keeps flapping, the cluster has bigger problems than JaaS.

Prevention

Pin replicas.max: 1 and LeaderElectionReleaseOnCancel: true (chart defaults). Multi-replica is only worth it for storage-backed HA — single-replica is the simpler operational story.
Run the cleanup Job (operator.cleanupOnDelete.enabled: true, chart default) so a helm uninstall of a wedged operator unwinds the finalizers instead of leaving orphaned snippets.
Pair this alert with JaaSControllerWorkqueueDepthHigh (workqueue-saturation.md ). A pod-down event almost always coincides with a saturated queue from snippets piling up.