Eval-concurrency saturation see history edit this page

Talks about: , , and

Not tied to a single Reason — this page covers what to do when the global concurrent-eval cap (--max-concurrent-evals) is full and JaaS is shedding new evaluations. The cap exists because the synchronous go-jsonnet API has no context-aware cancellation: once an eval starts it runs to natural completion, so an unbounded queue lets a runaway snippet pile up goroutines that outlive every caller’s deadline.

Symptom

One or more of:

Diagnosis: why is the cap full?

The cap fills for two distinct reasons. The right remediation depends on which.

Path A — runaway snippet (goroutines outliving their ctx)

Read the leak gauge. If it’s non-zero and trending up, evaluations are starting but not finishing — almost always a single snippet whose work dwarfs --evaluation-timeout.

# Live count of evals whose parent reconcile already timed out:
kubectl --namespace <jaas-ns> exec deploy/jaas -- \
  wget -qO- http://localhost:8083/metrics | grep jaas_eval_outstanding_timed_out

To find the culprit, scan recent reconcile logs for Jsonnet evaluation timed out followed by repeated EvalUnavailable warnings on the same snippet:

kubectl --namespace <jaas-ns> logs deploy/jaas --since=15m \
  | grep -E 'EvaluationTimeout|EvalUnavailable' \
  | sort | uniq -c | sort -rn | head

The snippet whose name dominates that list is the culprit. Common causes:

Path B — genuine load above the cap

Leak gauge is at zero (or steady, not growing), jaas_eval_in_flight is pegged near the cap, and many distinct snippets show EvalUnavailable events. The cap is sized too small for the workload.

# Distribution of which snippets are seeing rejections — a flat
# distribution across many snippets is path B; a single dominant
# snippet is path A.
kubectl --namespace <jaas-ns> exec deploy/jaas -- \
  wget -qO- http://localhost:8083/metrics \
  | grep jaas_snippet_eval_unavailable_total

Remediation

Path A — runaway snippet

  1. Suspend the offender to stop new evals while you fix the snippet:

    kubectl --namespace <ns> patch jsonnetsnippet <name> --type merge \
      --patch '{"spec":{"suspend":true}}'
    
  2. Inspect the snippet to understand the cost. Lower --max-stack is a blunt clamp that rejects pathological recursion before it can leak. The chart’s operator.maxStack defaults to 500; pull it down to ~200 if the snippet doesn’t legitimately need deeper recursion.

  3. Tighten --evaluation-timeout if the snippet’s natural completion time is the load-bearing factor. A 5s default lets a 60s pathological eval leak for nearly a minute; dropping to 1s shrinks the worst-case leak window.

  4. Re-enable after the snippet spec is fixed:

    kubectl --namespace <ns> patch jsonnetsnippet <name> --type merge \
      --patch '{"spec":{"suspend":false}}'
    

Path B — genuine load

  1. Raise the cap if the operator has CPU headroom. The default is max(GOMAXPROCS*4, 16); double it via the chart:

    helm upgrade <release> <chart> --reuse-values \
      --set arguments.maxConcurrentEvals=64
    

    Each in-flight eval pins roughly one CPU when actively running, so the practical ceiling is bounded by node CPU. Past 2-3× GOMAXPROCS the gains drop sharply — more contention, same throughput.

  2. Tune the per-snippet rate limiter if a small number of snippets dominate the request rate. --rerender-rate + --rerender-burst cap each snippet’s reconcile frequency independent of the global eval cap.

  3. Scale horizontally if a single replica can’t keep up even at the raised cap. The chart’s replicas.max controls the HPA ceiling; combined with the storage layer’s leader election (S3 backend) you get multi-replica HA where every replica reads but only the lease-holder writes.

When NOT to raise the cap

If the leak gauge is non-zero AND growing, raising the cap lets more goroutines pile up before the next saturation event. Diagnose path A first. The cap is a backpressure boundary, not a throughput knob.

--max-concurrent-evals=0 disables the gate entirely. The leak gauge keeps working, but rejections never fire — a single runaway snippet can OOM the pod. Use only if you’ve sized the workload precisely and want to surface saturation purely via the leak gauge.