Production

The chart’s defaults are safe for an initial install but not optimised for sustained production workloads. Work through these decisions before exposing JaaS to real traffic. Each links to the detailed guide.
1. Pick a storage backend
The single largest decision. Artifacts must survive pod restarts and, for HA, be readable by every replica simultaneously.
| Backend | Persistence | Multi-replica HA |
|---|---|---|
local + emptyDir (chart default) | No | No |
local + RWO PVC | Yes | No — single replica only |
local + RWX PVC | Yes | Yes — requires RWX storage class |
s3 | Yes | Yes — leader writes, all replicas read |
For cloud installs, s3 (AWS S3, MinIO, Ceph RGW, GCS S3-compat API) is the
recommended backend. Pair it with leader election (on by default) so only the
lease-holder writes. For on-prem, a PVC with the access mode your storage class
supports is the practical path.
Full configuration options and artifact retention are covered in
Storage and HA
— including the garbage-collection grace
period (--artifact-gc-grace) that keeps a just-superseded revision fetchable for
a short window, so a consumer that read status.artifact moments before pruning
doesn’t 404 on the revision it pinned.
Minimal S3 values (IRSA on EKS):
operator:
enabled: true
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/jaas-operator
storage:
backend: s3
s3:
endpoint: s3.amazonaws.com
bucket: my-jaas-artifacts
prefix: prod
region: eu-west-1
useSSL: true
# Leave accessKey/secretKey empty — IAM role via SA annotation.
2. Size CPU and memory
The chart defaults (64 MiB memory, 32m CPU) are fine for a quickstart but will OOM under sustained snippet rendering. Each in-flight evaluation is essentially uncancellable mid-flight — go-jsonnet has no mid-evaluation cancellation — so CPU and memory limits must accommodate the worst-case concurrent eval load.
Set --max-artifact-bytes to cap the rendered output size per snippet so a
runaway template can’t allocate unbounded memory before the timeout fires.
See Evaluation and security for the concurrent-eval cap, timeout defaults, and how to tune them.
resources:
memory: 256Mi
cpu: 100m
operator:
storage:
maxArtifactBytes: 16777216 # 16 MiB; fails with ReasonArtifactTooLarge
3. Enable observability
The chart ships a metrics endpoint (on by default at port 8083), an opt-in
ServiceMonitor, and an opt-in PrometheusRule with a starter alert set. Turn
them on and wire the Prometheus selector labels before deploying:
operator:
metrics:
enabled: true
serviceMonitor:
enabled: true
labels:
release: kube-prom # match your Prometheus's serviceMonitorSelector
prometheusRule:
enabled: true
labels:
release: kube-prom # match your Prometheus's ruleSelector
extraAlertLabels:
team: platform # Alertmanager routing label
serviceMonitor.labels, prometheusRule.labels, and
prometheusRule.extraAlertLabels are three distinct label knobs:
serviceMonitor.labels and prometheusRule.labels control which Prometheus
instance picks up each CRD object; extraAlertLabels adds routing labels to
individual alerts (for Alertmanager), not to the rule object.
The shipped alert set and all custom JaaS metrics are documented in Observability .
4. Enable the admission webhook
The webhook rejects spec invariant violations — ext-var key collisions, library
alias shadowing, import cycles — at kubectl apply time instead of at
reconcile time. Pick a cert mode:
# Option A: cert-manager (recommended when cert-manager is installed)
operator:
webhook:
enabled: true
certMode: cert-manager
certManager:
enabled: true
issuerRef:
kind: ClusterIssuer
name: letsencrypt-prod
# Option B: self-signed (no cert-manager required)
operator:
webhook:
enabled: true
certMode: self-signed
The default failurePolicy: Fail blocks every JsonnetSnippet create/update
cluster-wide when the webhook is unavailable. During a rolling update the window
is typically under five seconds (leader election releases the lease on
SIGTERM). If your GitOps tooling cannot tolerate that, scope the webhook via
operator.webhook.namespaceSelector or operator.webhook.objectSelector, or
switch to failurePolicy: Ignore and rely on the reconciler-side fallback.
Full cert provisioning and failurePolicy trade-offs are covered in Admission webhook .
5. Lock down tenant RBAC
Every JsonnetSnippet runs impersonated as its spec.serviceAccountName (or
the --default-service-account fallback). The operator’s own ServiceAccount
only needs serviceaccounts/token: create — every other API call (library
reads, source fetches, ExternalArtifact writes) is done under the tenant SA’s
RBAC, so a compromised snippet can only reach what its SA is allowed to.
Minimum per-tenant Role:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: <tenant-namespace>
name: jaas-tenant
rules:
- apiGroups: [source.toolkit.fluxcd.io]
resources: [externalartifacts]
verbs: [get, create, update, patch]
- apiGroups: [jaas.metio.wtf]
resources: [jsonnetlibraries]
verbs: [get, list]
- apiGroups: [source.toolkit.fluxcd.io]
resources: [gitrepositories, ocirepositories, buckets, externalartifacts]
verbs: [get]
When operator.watchNamespaces is set, the chart automatically switches from a
ClusterRoleBinding to per-namespace RoleBindings. Full RBAC layout and
NetworkPolicy notes for spec.sourceRef fetches are in
Tenancy and RBAC
.
6. Plan for upgrades and disaster recovery
Calendar-based releases run every Monday. Chart upgrades are helm upgrade --install; the chart ships CRDs under templates/ so schema changes apply
automatically. Read
MIGRATIONS.md
before
each upgrade — releases that change immutable spec.selector.matchLabels fields
require a manual kubectl delete deploy/jaas first.
Three runbooks to bookmark before go-live:
- storage-recovery — PVC loss, S3 outages, disk-full, downstream 404s.
- rbacdenied — tenant SA missing a verb, ExternalArtifact write forbidden, Flux source CRD not installed.
- operator-watch-silent — the one failure mode JaaS cannot surface in snippet status (operator’s own ClusterRole missing a verb so controller-runtime’s informer silently fails).
A complete production values.yaml
replicas:
min: 2
max: 5
resources:
memory: 256Mi
cpu: 100m
image:
pullPolicy: IfNotPresent
namespace:
create: true
podSecurity:
enforce: restricted
operator:
enabled: true
defaultServiceAccount: "" # force per-snippet spec.serviceAccountName
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/jaas-operator # TODO
storage:
backend: s3
s3:
endpoint: s3.amazonaws.com # TODO
bucket: my-jaas-artifacts # TODO
prefix: prod
region: eu-west-1 # TODO
maxArtifactBytes: 16777216
metrics:
enabled: true
serviceMonitor:
enabled: true
labels:
release: kube-prom # TODO
prometheusRule:
enabled: true
labels:
release: kube-prom # TODO
extraAlertLabels:
team: platform # TODO
webhook:
enabled: true
certMode: cert-manager
certManager:
issuerRef:
kind: ClusterIssuer
name: letsencrypt-prod # TODO
leaderElection:
enabled: true
cleanupOnDelete:
enabled: true
Apply with:
helm upgrade --install jaas oci://ghcr.io/metio/helm-charts/jaas \
--namespace jaas-system --create-namespace \
--values production-values.yaml \
--wait --timeout 5m
Next steps
- Operations — day-two tasks: rolling restarts, storage sweeping, finalizer teardown.
- Configuration reference — every flag and default.
- Runbooks — incident response.