Operations

Day-two operations for a running JaaS install. Initial install and hardening decisions are in Kubernetes and Production .
Graceful shutdown and drain
When Kubernetes sends SIGTERM, JaaS executes a two-phase shutdown to avoid
dropping in-flight requests:
- The readiness probe flips to
false(503on/ready). Kubernetes endpoint controllers begin deregistering the pod from Services. - JaaS waits for
--shutdown-delay(default5s) before closing its listeners. This window lets the endpoint propagation complete so no new traffic arrives after the server closes. - After the delay, the servers shut down gracefully with a 30-second
context.WithTimeout. The operator goroutine is also cancelled and awaited within the same 30-second window.
The distroless runtime image has no sleep binary, so the drain delay is
implemented in the binary rather than via a preStop hook. A second
SIGTERM (or SIGINT) during the drain cuts the wait short.
To disable the drain (zero delay):
--shutdown-delay 0
The chart value is arguments.shutdownDelay.
Leader election during rolling updates
In operator mode, leader election is on by default (--leader-election,
operator.leaderElection.enabled: true). The chart sets
LeaderElectionReleaseOnCancel: true, so when the old pod receives SIGTERM it
releases the lease immediately instead of waiting out the 15-second
LeaseDuration. The new pod picks up the lease within milliseconds.
Snippets that were Ready=True before the restart stay in that condition via
cached state. A new pod that takes over as leader reconciles them on the next
watch event. If snippets remain degraded for more than a few seconds after a
restart, check the operator-watch-silent
runbook — it diagnoses the case where the operator’s own ClusterRole is missing
a verb so controller-runtime’s informer silently fails to start.
To force-restart the operator (e.g. after an upgrade):
kubectl rollout restart deployment/jaas --namespace jaas-system
Artifact retention and storage GC
Three independent mechanisms govern how long artifacts stay on disk (or in S3). Full storage backend configuration is in Storage and HA .
GC grace window (--artifact-gc-grace, default 5m)
When a snippet is re-rendered, the superseded revision drops out of the keep-set
but remains fetchable for --artifact-gc-grace after supersession. This closes
the pin→fetch race in which a Flux consumer reads status.artifact.url a moment
before the operator GC-prunes the old tarball. The window survives operator
restarts — supersession time is derived from on-disk storage metadata, not from
in-memory state.
Set 0 to restore eager pruning (one revision at a time, matching stock Flux
source-controller semantics). Tune lower when storage capacity is tight and all
consumers are in-cluster.
History retention (spec.history, default 1, max 50)
Per-snippet deliberate retention for rollback and blue-green flows. A downstream
consumer can pin to a specific sha256 digest indefinitely as long as that
revision is within the history keep-set. This is separate from the GC grace
window — it is explicit operator intent, not a race-protection mechanism.
Orphaned .tmp sweep (--storage-sweep-interval, --storage-sweep-max-tmp-age)
A Put that dies between writing the tempfile and the atomic rename leaves a
<rev>.tar.gz.tmp residue. The sweep goroutine runs on a ticker (default every
10m) and removes .tmp files older than --storage-sweep-max-tmp-age
(default 30m). The age floor ensures live writers are never raced.
Set --storage-sweep-interval 0 to disable the sweep entirely. The
jaas_storage_sweep_failures_total Prometheus counter signals failing sweep
passes.
Finalizer teardown and the WithdrawForced safety valve
Every JsonnetSnippet holds a finalizer (jaas.metio.wtf/finalizer) that
blocks Kubernetes garbage collection until the operator successfully calls
Publisher.Withdraw to remove the artifact from storage. If the backend is
permanently unavailable (S3 down, RBAC revoked, bucket deleted), the finalizer
would otherwise hold the snippet — and by extension its namespace — in
Terminating forever.
--max-withdraw-wait (default 1h) bounds how long the finalizer can hold.
Once the deadline passes, the operator:
- Emits a
Warning WithdrawForcedKubernetes Event on the snippet. - Drops the finalizer so the snippet can be garbage-collected.
The trade-off is a possible orphan tarball in storage. Recover it using the storage-recovery runbook.
Adjust the bound with the chart value operator.maxWithdrawWait. Lower it in
environments where namespace teardown latency is critical; raise it (or remove
the concern by fixing the backend) in environments where artifact-safety is
paramount.
Upgrades
Calendar-based releases ship every Monday. The chart version and the binary version advance together.
helm upgrade --install jaas oci://ghcr.io/metio/helm-charts/jaas \
--namespace jaas-system \
--values my-values.yaml \
--wait --timeout 5m
The chart ships CRDs under templates/ so helm upgrade --install applies schema
changes automatically.
Before each upgrade, read MIGRATIONS.md :
- Releases that change
spec.selector.matchLabelson the Deployment require a manualkubectl delete deployment/jaasfirst — that field is immutable andhelm upgrade --installwill fail otherwise. - The pre-delete cleanup Job (
operator.cleanupOnDelete.enabled: true, the default) runs onhelm uninstalland drops every snippet’s finalizer soExternalArtifactresources are unwound before the operator pod is removed. If the cleanup Job hangs, checkoperator.cleanupOnDelete.kubectlTimeout(default2m) and the backend health.
Monitoring operational health
Key signals to watch:
jaas_storage_sweep_failures_total— non-zero means the sweep goroutine is erroring; investigate storage backend health.jaas_snippet_reconcile_total{status!="Synced"}— elevated rate means snippets are failing to render; cross-reference with thereasonlabel and the relevant runbook.JaaSControllerWorkqueueDepthHighPrometheusRule alert — workqueue is backing up; the operator cannot keep up with the reconcile rate./readyprobe on the management port (default8081) —503after startup means the manager has not yet been elected or its cache has not synced.
All metrics are documented in Observability . All shipped alerts link to Runbooks .
Next steps
- Configuration reference — the full flag list with defaults and chart value equivalents.
- Runbooks
— incident response procedures keyed to each
ReadyconditionReason.