CRD watch engagement failing

Fires when jaas_crd_watch_engagement_failures_total{gvk=...} has increased above the per-hour threshold for the alert window. JaaS lazy-watches Flux source CRDs: at boot, only the CRDs already installed get a watch; when a previously-missing CRD becomes Established=True (operator installed source-controller post hoc, say), the crdWatcher engages a runtime watch on it via Controller.Watch. When that engagement fails, the apiextensions informer fires no further events on a stable CRD — meaning the watch stays un-engaged forever until either the CRD object’s metadata/status is changed by something else, or the operator restarts.
The visible symptom is that snippets with spec.sourceRef.Kind=<the affected kind> stop re-rendering on upstream source updates. There is no per-snippet status signal — they sit at their last-rendered revision, drifting from upstream.
Symptom
JaaSCRDWatchEngagementFailingalert is firing withgvklabelling the affected kind.kubectl describe jsonnetsnippeton snippets referencing that GVK shows a Ready condition that hasn’t moved in hours/days.- Upstream Flux source CRs (GitRepository, OCIRepository, Bucket, ExternalArtifact) show recent
status.artifact.revisionchanges that the jaas snippets aren’t picking up.
Diagnosis
Step 1 — confirm the CRD is actually installed and Established
kubectl get crd <plural>.source.toolkit.fluxcd.io \
--output jsonpath='{.status.conditions[?(@.type=="Established")].status}{"\n"}'
Expect True. If the CRD is not installed or not yet Established, the watcher is correct to skip; install / wait.
Step 2 — check the operator’s RBAC on the source kind
kubectl auth can-i list <plural>.source.toolkit.fluxcd.io \
--as=system:serviceaccount:<ns>:<operator-sa>
kubectl auth can-i watch <plural>.source.toolkit.fluxcd.io \
--as=system:serviceaccount:<ns>:<operator-sa>
If either is “no”, the chart’s operator-tenants ClusterRole (or per-namespace RoleBinding when watchNamespaces is set) is missing the get/list/watch verbs on this kind. Update the chart’s FluxSourceKinds mapping or add the verb manually.
Step 3 — check controller-runtime cache state
kubectl --namespace <ns> logs <operator-pod> | grep -E 'engage|Failed to watch|cache' | tail -20
Look for cache reconnect, informer failed, or Watch failed: forbidden. A transient cache reconnect during a heavy load period can trip engagement once; the DD7 bounded-retry mechanism re-engages automatically. Sustained failures point at RBAC or a misconfigured MetricsBindAddress.
Remediation
Fix the verb / kind / RBAC issue identified above.
Roll the operator pod to force a fresh
SetupWithManagerpass, which re-detects every Flux CRD and re-engages watches that succeed on first try:kubectl --namespace <ns> rollout restart deployment <operator-deployment>Verify the counter stops increasing and the alert clears.
When the alert is noisy
If jaas_crd_watch_engagement_failures_total ticks once at boot but never again, that’s the expected DD7 bounded-retry behavior: the first attempt failed (transient race during cache start), the retry succeeded. Raise crdWatchEngagementFailuresPerHour if the boot-time blip is noisy enough to page.