Workqueue saturation see history edit this page

Talks about: , , and

Linked from the JaaSControllerWorkqueueDepthHigh alert. Fires when the reconciler’s workqueue holds more items than the configured threshold (default 50) for the alert window. Not tied to a Reason constant — workqueue depth is a controller-runtime signal, not a per-snippet status.

Symptom

ALERTS{alertname="JaaSControllerWorkqueueDepthHigh", controller="jsonnetsnippet"}

Cause

The operator is dequeuing reconciles slower than the API server enqueues them. Common causes, in observed-frequency order:

  1. Slow Publisher backend. S3 throttling, a slow PVC, or a stalled object-store transaction — each reconcile blocks on the storage Put.
  2. API server pressure. The cluster’s apiserver is slow on GET / UPDATE (often during a control-plane upgrade or under heavy general load).
  3. Per-snippet rate-limiter exhaustion. A flapping snippet eats its token-bucket budget; the controller’s exponential backoff stretches the queue.
  4. A large fan-out from a single source watch. One Flux source republishes and 100 snippets reference it; every snippet’s reconcile lands in the queue at once.
  5. Webhook latency. When --enable-webhook is on, every snippet write traverses the validating webhook. A wedged webhook (cert issue, slow tenant client) holds the apiserver’s call open and indirectly enlarges the queue.

Diagnosis

# Per-controller queue depth — confirm which controller is saturated
kubectl --namespace <jaas-ns> port-forward svc/jaas-metrics 8083:8083 &
curl -s localhost:8083/metrics | grep -E 'workqueue_depth|workqueue_adds_total'

# Reconcile-time histogram — separates "lots of queued items" (fan-out)
# from "each reconcile is slow" (storage / apiserver).
curl -s localhost:8083/metrics | grep 'controller_runtime_reconcile_time_seconds'

Cross-reference operator logs for the slow path:

kubectl --namespace <jaas-ns> logs deploy/jaas --tail=500 \
  | grep -E 'reconcile|publisher|s3|webhook'

If controller_runtime_reconcile_time_seconds p99 is also high, the alert is the symptom — JaaSReconcileLatencyHigh is the more useful page; see reconcile-latency.md .

Remediation

Prevention