Watch-layer silent failure

Not tied to a per-snippet Reason. This page covers the one RBAC-denial path the reconciler cannot surface itself: when the operator’s own ClusterRole is missing a verb on a watched resource kind, controller-runtime’s informer fails to start its watch, logs warnings, and retries silently. The reconciler never sees the failure — and no snippet’s status condition will tell you about it.
If a per-snippet runbook (rbacdenied.md, sourcefetchfailed.md, sourcenotready.md) doesn’t match the symptoms, this is where to look next.
Symptom
- Snippets that worked yesterday stop receiving watch-driven re-renders. They still reconcile on edits or
spec.intervalticks, but not on upstream source changes. kubectl describe jsonnetsnippetshows healthy or stale state — neverReason=RBACDeniedor any other failure.- Operator pod is
Ready=True, all probes pass. - A
Flux GitRepository(orOCIRepository/Bucket/ExternalArtifact/JsonnetLibrary) advances itsstatus.artifactbut no JaaS reconcile fires.
Why JaaS can’t tell you directly
controller-runtime’s informer is what watches resource kinds at the apiserver. If the operator SA lacks list/watch on a kind:
- The informer’s initial LIST returns
Forbidden. - controller-runtime logs
Failed to watch *v1.X: forbidden: ...at error level. - The informer retries with exponential backoff — forever.
- The reconciler’s reconcile loop never gets events from that kind.
- The reconciler itself is unaware that the watch is non-functional. The “no events arriving” condition is indistinguishable from “no actual changes upstream.”
This is the one diagnostic surface the operator can’t unify with its other RBAC-denial paths (Fetcher / library Get / Publisher write — all per-reconcile and surfaced via Reason=RBACDenied).
Diagnosis
The smoking gun is in the operator’s logs:
kubectl --namespace <jaas-ns> logs deploy/jaas --tail=2000 \
| grep -E 'Failed to watch|"reflector.go"' \
| head -30
Expected output if the watch layer is healthy: nothing.
If broken, you’ll see lines like:
E0610 12:34:56.789 reflector.go:227 "Failed to watch" err="failed to list *v1.GitRepository: ... forbidden: ServiceAccount \"jaas\" cannot list resource \"gitrepositories\" in API group \"source.toolkit.fluxcd.io\" at the cluster scope" type="*v1.GitRepository"
The error names the SA, verb, resource, and API group — that’s the exact gap to close.
Check what the operator SA can actually do:
kubectl auth can-i list gitrepositories.source.toolkit.fluxcd.io \
--as=system:serviceaccount:<jaas-ns>:jaas
Compare against the chart-rendered ClusterRole:
kubectl get clusterrole <release>-operator --output yaml | grep -A2 source.toolkit.fluxcd.io
Remediation
The operator’s ClusterRole verbs are defined in the chart’s templates/clusterrole-operator.yaml (in the metio/helm-charts repo, under charts/jaas/). Three causes warrant separate fixes:
1. Chart upgraded with rbac.create: false
Someone disabled chart-rendered RBAC (operator.rbac.create: false) and the external RBAC source missed a verb. Either re-enable chart-rendered RBAC, or update whatever owns the ClusterRole to grant the missing verbs.
2. Manual chart edit removed verbs
A kubectl edit clusterrole <release>-operator or a hand-rolled overlay removed verbs the chart originally rendered. Restore via helm upgrade --install (idempotent for an installed chart).
3. New source kind added but chart’s drift gate didn’t catch it
The drift-gate test in the chart’s tests/clusterrole-operator_test.yaml (metio/helm-charts, under charts/jaas/) — the “ClusterRole drift gate” case — is supposed to fail at PR time if operator.FluxSourceKinds adds a kind without a matching ClusterRole entry. If you reach this runbook page because of a new kind, it means the test passed but production RBAC is still missing it — investigate why (test bypassed, chart drift, etc.). Add the verb manually as a hotfix, then file a bug against the drift gate.
After granting the verb, restart the operator pod so a fresh informer picks up the new ServiceAccount-token permissions:
kubectl --namespace <jaas-ns> rollout restart deploy/jaas
Watch-driven re-renders resume within seconds.
Why not detect this automatically
A startup probe (operator does a test LIST per kind and refuses to boot on Forbidden) was considered and rejected:
- It would block startup on transient apiserver flakes during deploys.
- The CRD-watcher pattern already handles the missing-CRD case gracefully (the
crdWatcherengages a watch dynamically when the CRD becomesEstablished=True). Layering “and also fail on Forbidden” complicates that contract. - A misconfigured cluster should surface the issue via the existing logs + the
kubectl auth can-iworkflow, which is the standard k8s troubleshooting path.
The diagnostic trail above is the supported recovery story. If a user reports hitting this in the wild and finds the log-grep step too obscure, a follow-up is a jaas_informer_watch_failures_total Prometheus counter plus a JaaSInformerWatchFailing alert — same shape as JaaSStorageSweepFailures from the storage layer. Track in open-items.md if it comes up.
Related runbooks
- rbacdenied.md
— per-reconcile RBAC denials the reconciler CAN surface (tenant SA can’t read a source CR / library, can’t write ExternalArtifact). If a snippet’s status says
Reason=RBACDenied, start there instead. - storage-recovery.md — different failure surface (storage backend rather than apiserver), same “graceful degradation, diagnosis via logs + metrics” shape.