Operator pod not ready

Linked from the JaaSOperatorPodDown alert. Fires when at least one jaas pod has been Ready=False for the alert window (default 5m).
Symptom
ALERTS{alertname="JaaSOperatorPodDown", namespace="<jaas-ns>"}
- New JsonnetSnippets stay
Ready=Unknownindefinitely (no controller reconciling them). - Existing snippets keep serving stale
ExternalArtifactcontent (the storage HTTP server may still respond; reconciliation is the part that stopped).
Cause
One of the chart’s two probes is failing:
- Liveness (
/live) is unconditional 200 — a failure here means the pod’s HTTP server itself isn’t responding (deadlock, OOM, panic). - Readiness (
/ready) consultsHealthState, which goesfalseduringdrainBeforeShutdownor before the listeners bind. - Startup (
/start) returns 503 untilMarkStarted()is called — bind failures (port already in use, permission denied) keep the pod stuck here forever.
Frequent causes:
- Bind failure on one of the HTTP servers (jsonnet, management, storage, metrics, webhook). The pod logs a clear “listen tcp: address already in use” or similar at boot.
- OOMKilled — a pathological snippet allocated a huge object; the kubelet killed the pod.
kubectl describe podshowsLast State: Terminated, Reason: OOMKilled. - Image pull failure — registry rate limit, wrong tag, missing pull secret.
- TLS cert missing or unreadable when
operator.webhook.enabled=trueand the cert-manager Secret hasn’t materialized. - Lease contention that leaves no replica as leader (every replica reconnecting to renew, never holding the lease).
Diagnosis
# Which probes are failing? Events tell you.
kubectl --namespace <jaas-ns> describe pod --selector app.kubernetes.io/name=jaas
# Pod logs — the boot sequence prints every listener it binds.
kubectl --namespace <jaas-ns> logs --selector app.kubernetes.io/name=jaas --tail=300
# Compare against the expected listener set.
kubectl --namespace <jaas-ns> get svc --selector app.kubernetes.io/name=jaas
For OOM:
kubectl --namespace <jaas-ns> top pod --selector app.kubernetes.io/name=jaas
kubectl --namespace <jaas-ns> get pod --selector app.kubernetes.io/name=jaas --output yaml \
| grep -A3 lastState
For lease problems (multi-replica only):
kubectl --namespace <jaas-ns> get lease <release-name>-operator --output yaml
holderIdentity flipping every renewal interval is a sign of network flake or apiserver pressure — the replicas can’t keep the lease stable.
Remediation
- Bind failure. Free the colliding port (often
8080, when the controller-runtime metrics endpoint defaults conflict with the jsonnet HTTP port — confirm--metrics-bind-addressis:8083). - OOMKilled. Raise
resources.memory, then identify the runaway snippet (the bench ininternal/operator/bench_test.gois a regression baseline; the runaway is usually obvious fromjaas_snippet_rendered_bytes). - Image pull. Standard k8s drill: check secrets, registry, tag.
- TLS cert. With
certMode=cert-manager, confirm the Issuer / Certificate are ready. WithcertMode=self-signed, the operator regenerates on boot — a permission error on the cert-dir mount blocks it. - Lease flap. Try
kubectl --namespace <jaas-ns> delete lease <release-name>-operatorto force a fresh election. If it keeps flapping, the cluster has bigger problems than JaaS.
Prevention
- Pin
replicas.max: 1andLeaderElectionReleaseOnCancel: true(chart defaults). Multi-replica is only worth it for storage-backed HA — single-replica is the simpler operational story. - Run the cleanup Job (
operator.cleanupOnDelete.enabled: true, chart default) so ahelm uninstallof a wedged operator unwinds the finalizers instead of leaving orphaned snippets. - Pair this alert with
JaaSControllerWorkqueueDepthHigh(workqueue-saturation.md ). A pod-down event almost always coincides with a saturated queue from snippets piling up.