Eval-concurrency saturation

Not tied to a single Reason — this page covers what to do when the global concurrent-eval cap (--max-concurrent-evals) is full and JaaS is shedding new evaluations. The cap exists because the synchronous go-jsonnet API has no context-aware cancellation: once an eval starts it runs to natural completion, so an unbounded queue lets a runaway snippet pile up goroutines that outlive every caller’s deadline.
Symptom
One or more of:
JaaSEvalSaturationis firing —jaas_eval_in_flight / jaas_eval_max_concurrenthas been above the threshold (default0.9) for the alert window.JaaSEvalRejectedis firing —rate(jaas_eval_unavailable_total[5m])has been above the threshold.- HTTP clients see
503 Service Unavailablewith body{"error": "evaluation_unavailable", "message": "concurrent-eval cap is full; retry after backoff"}. kubectl describe jsonnetsnippetshows recurringWarning EvalUnavailableevents with messagereconcile deferred for 1s by --max-concurrent-evals. Ready condition stays untouched (backpressure is not failure).jaas_eval_outstanding_timed_outis also elevated — confirms the runaway-snippet diagnosis: orphaned evals are pinning slots while their parents have already given up.
Diagnosis: why is the cap full?
The cap fills for two distinct reasons. The right remediation depends on which.
Path A — runaway snippet (goroutines outliving their ctx)
Read the leak gauge. If it’s non-zero and trending up, evaluations are starting but not finishing — almost always a single snippet whose work dwarfs --evaluation-timeout.
# Live count of evals whose parent reconcile already timed out:
kubectl --namespace <jaas-ns> exec deploy/jaas -- \
wget -qO- http://localhost:8083/metrics | grep jaas_eval_outstanding_timed_out
To find the culprit, scan recent reconcile logs for Jsonnet evaluation timed out followed by repeated EvalUnavailable warnings on the same snippet:
kubectl --namespace <jaas-ns> logs deploy/jaas --since=15m \
| grep -E 'EvaluationTimeout|EvalUnavailable' \
| sort | uniq -c | sort -rn | head
The snippet whose name dominates that list is the culprit. Common causes:
- Deep recursion that takes seconds-to-minutes to complete naturally even after the parent deadline fires.
- Pathological library import that triggers go-jsonnet’s worst-case eval order.
- A
std.foldlover a generated array of millions of entries.
Path B — genuine load above the cap
Leak gauge is at zero (or steady, not growing), jaas_eval_in_flight is pegged near the cap, and many distinct snippets show EvalUnavailable events. The cap is sized too small for the workload.
# Distribution of which snippets are seeing rejections — a flat
# distribution across many snippets is path B; a single dominant
# snippet is path A.
kubectl --namespace <jaas-ns> exec deploy/jaas -- \
wget -qO- http://localhost:8083/metrics \
| grep jaas_snippet_eval_unavailable_total
Remediation
Path A — runaway snippet
Suspend the offender to stop new evals while you fix the snippet:
kubectl --namespace <ns> patch jsonnetsnippet <name> --type merge \ --patch '{"spec":{"suspend":true}}'Inspect the snippet to understand the cost. Lower
--max-stackis a blunt clamp that rejects pathological recursion before it can leak. The chart’soperator.maxStackdefaults to 500; pull it down to ~200 if the snippet doesn’t legitimately need deeper recursion.Tighten
--evaluation-timeoutif the snippet’s natural completion time is the load-bearing factor. A 5s default lets a 60s pathological eval leak for nearly a minute; dropping to 1s shrinks the worst-case leak window.Re-enable after the snippet spec is fixed:
kubectl --namespace <ns> patch jsonnetsnippet <name> --type merge \ --patch '{"spec":{"suspend":false}}'
Path B — genuine load
Raise the cap if the operator has CPU headroom. The default is
max(GOMAXPROCS*4, 16); double it via the chart:helm upgrade <release> <chart> --reuse-values \ --set arguments.maxConcurrentEvals=64Each in-flight eval pins roughly one CPU when actively running, so the practical ceiling is bounded by node CPU. Past 2-3× GOMAXPROCS the gains drop sharply — more contention, same throughput.
Tune the per-snippet rate limiter if a small number of snippets dominate the request rate.
--rerender-rate+--rerender-burstcap each snippet’s reconcile frequency independent of the global eval cap.Scale horizontally if a single replica can’t keep up even at the raised cap. The chart’s
replicas.maxcontrols the HPA ceiling; combined with the storage layer’s leader election (S3 backend) you get multi-replica HA where every replica reads but only the lease-holder writes.
When NOT to raise the cap
If the leak gauge is non-zero AND growing, raising the cap lets more goroutines pile up before the next saturation event. Diagnose path A first. The cap is a backpressure boundary, not a throughput knob.
Disable the gate (not recommended)
--max-concurrent-evals=0 disables the gate entirely. The leak gauge keeps working, but rejections never fire — a single runaway snippet can OOM the pod. Use only if you’ve sized the workload precisely and want to surface saturation purely via the leak gauge.