Storage backend recovery

Not tied to a single Reason — this page covers what to do when the artifact store itself is degraded (PVC lost, S3 endpoint unavailable, the storage HTTP server is down). Downstream Flux consumers (kustomize-controller, helm-controller, grafana-operator) dereference ExternalArtifact.status.artifact.url to fetch tarballs; when that URL stops returning bytes, dependent resources stall.
Symptom
One or more of:
- Downstream Flux consumers report
404 Not Foundorconnection refusedagainst the JaaS storage URL. kubectl get externalartifact --all-namespacesshows resources whose URL is unreachable from the consumer pods.- The operator pod is healthy (Ready=True on snippets), but the storage Service is unresponsive.
helm upgradeof the chart frompersistence.enabled: falsetotrue— or vice versa — caused a gap.
Triage: which backend are you running?
kubectl --namespace <jaas-ns> get deploy jaas \
--output jsonpath='{.spec.template.spec.containers[0].args}' \
| tr ',' '\n' | grep -E 'storage-backend|storage-path|s3-endpoint'
--storage-backend=local→ filesystem behind--storage-path. Either an emptyDir (chart default) or a PVC.--storage-backend=s3→ an external S3-compatible bucket; the storage HTTP server in-pod is a thin streaming proxy overminio-go.
Filesystem backend
PVC lost or replaced
Symptom: every ExternalArtifact URL returns 404 even though the snippet’s Ready=True. The Publisher writes idempotently on every reconcile, so making the operator re-render every snippet is the fix:
# Roll the operator — the cache is rebuilt from the apiserver and every
# snippet is reconciled. Each reconcile re-runs the Publisher, which
# writes the tarball back to disk. With a clean PVC, the gap closes in
# one reconcile loop.
kubectl --namespace <jaas-ns> rollout restart deploy/jaas
If reconciles do not produce tarballs again, force a reconcile per snippet:
kubectl annotate --all-namespaces jsonnetsnippet --all \
jaas.metio.wtf/reconcile-at=$(date -u +%FT%TZ) --overwrite
The window between PVC loss and the first re-render is the only outage downstream consumers see. With replicas.max: 1 (chart default) that window is bounded by the rollout time; with multi-replica HA + RWX PVC, the lease-holder writes immediately and the gap is sub-second.
emptyDir reset (pod restart)
persistence.enabled: false is fine for low-stakes deployments but every pod restart re-renders every snippet. The “fix” is to enable persistence:
helm --namespace <jaas-ns> upgrade jaas oci://ghcr.io/metio/helm-charts/jaas \
--reuse-values \
--set operator.storage.persistence.enabled=true \
--set operator.storage.persistence.size=10Gi
After the upgrade, follow the PVC-lost steps above to repopulate the new volume.
Storage HTTP server unreachable but operator healthy
Diagnose:
kubectl --namespace <jaas-ns> port-forward svc/jaas-storage <port>:8082 &
curl -fsSL http://localhost:<port>/<namespace>/<snippet>/<rev>.tar.gz | wc -c
If port-forward works but in-cluster fetches fail, look at NetworkPolicy:
kubectl get networkpolicy --all-namespaces | grep -i jaas
The chart’s optional NetworkPolicy locks the storage port to a single source-controller selector. If your Flux install lives elsewhere or carries different labels, the NetworkPolicy will silently drop the traffic. Either widen networkPolicy.fromSourceControllerSelector or disable the NetworkPolicy on this chart and rely on a cluster-wide policy.
S3 backend
Endpoint unreachable / 5xx from the provider
The pod-side storage HTTP server is a proxy. When the upstream S3 endpoint is down, the proxy returns 502/504 and downstream Flux consumers retry with backoff. Operator pod health is unaffected.
Diagnose:
# Pull a recent tarball directly to confirm it's the upstream
kubectl --namespace <jaas-ns> exec deploy/jaas -- \
wget -O- http://localhost:8082/<namespace>/<snippet>/<rev>.tar.gz | wc -c
If the in-pod fetch fails too, check the operator logs for minio-go errors. Auth problems (expired session token, rotated access key) show up here distinctly from network problems.
Bucket gone or wrong prefix
If the bucket was emptied or the --s3-prefix changed, the proxy returns 404 even though the snippet is Ready. Re-render every snippet to repopulate:
kubectl annotate --all-namespaces jsonnetsnippet --all \
jaas.metio.wtf/reconcile-at=$(date -u +%FT%TZ) --overwrite
The Publisher writes idempotently — running this against a working bucket is safe.
Credentials rotated
With static credentials (--s3-access-key / --s3-secret-key / inline chart values), a rotation requires a Deployment restart for minio-go to pick up the new values. With IAM/IRSA, the discovery chain re-reads at request time — the operator picks the new identity up automatically.
Force the new keys to take effect:
kubectl --namespace <jaas-ns> rollout restart deploy/jaas
Disk full (ENOSPC)
When the volume backing --storage-path fills up, Store.Put returns the kernel’s ENOSPC verbatim → Publisher.Publish wraps it → no specific sentinel matches → classified as transient ReasonSourceFetchFailed → controller-runtime backoff retries forever at the ~16 min cap. The operator pod stays healthy; every snippet using local storage starts looping.
Symptom
- Multiple snippets simultaneously flip to
Ready=Falsewith messages mentioningno space left on device. JaaSControllerWorkqueueDepthHighalert fires (the backoff queue saturates).kubectl --namespace <jaas-ns> exec deploy/jaas -- df -h /var/lib/jaas/artifactsshows the volume at 100%.
Recovery
Free space. Either resize the PVC (if
operator.storage.persistence.enabled: true), increaseoperator.storage.sizeLimit(emptyDir), or prune retained revisions:# Lower spec.history on noisy snippets so the next reconcile prunes # older revisions. The Publisher's Prune step removes everything # outside the keep-set, freeing space proportional to the change. kubectl patch jsonnetsnippet <name> --type=merge --patch '{"spec":{"history":1}}'For an immediate flush, force-prune by removing the artifact directory of a snippet you’re certain doesn’t need its history:
kubectl --namespace <jaas-ns> exec deploy/jaas -- \ rm -rf /var/lib/jaas/artifacts/<namespace>/<expendable-snippet>Drive reconciliation. The backoff cap is ~16 min, so failing snippets retry within that window automatically. To force immediate re-render:
# Annotate every snippet that flipped to Ready=False: kubectl get jsonnetsnippets --all-namespaces \ --output jsonpath='{range .items[?(@.status.conditions[?(@.type=="Ready")].status=="False")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' \ | xargs -n1 -I {} sh -c 'ns=$(echo "{}" | cut -d/ -f1); n=$(echo "{}" | cut -d/ -f2); \ kubectl --namespace "$ns" annotate jsonnetsnippet "$n" \ jaas.metio.wtf/reconcile-at=$(date -u +%FT%TZ) --overwrite'Prevention. Set
operator.storage.maxArtifactBytesto cap individual snippet renders before they hit the disk. UseJaaSSnippetArtifactGrowing(opt-in PrometheusRule alert) to catch creep before saturation.
OOM during render (kubelet kills the operator pod mid-publish)
A snippet that evaluates into a multi-MB JSON tree can push the operator past its resources.memory limit. The kubelet SIGKILLs the pod. Effects:
- The pod restarts cleanly. Probes flip back to Ready within seconds.
- The leader-election lease is dropped on the killed pod’s process death; the next replica picks it up immediately (or, on single-replica installs, the new pod re-acquires).
- Mid-write
.tmpfiles in the local store are orphaned because theStore.Putwas interrupted betweenCreateand the atomicRename. The backgroundSweep(default every 10 min, configurable viaoperator.storage.sweep.interval) cleans them up once theirModTimefalls outside themaxTmpAgewindow. - The snippet that triggered the OOM keeps failing in the same way on the next reconcile because its rendered output is what blew memory in the first place. The pod loop-restarts until either the snippet is fixed or its memory cost falls below the limit.
Symptom
- Operator pod restarts every few minutes;
kubectl describe podshowsLast State: Terminated, Reason: OOMKilled. JaaSOperatorPodDownalert fires (the restart window is shorter than the recovery, so probes flap).- One specific snippet correlates with each restart.
Diagnose
The killing snippet is whichever the operator was reconciling when memory peaked. Easiest way to identify:
kubectl --namespace <jaas-ns> logs deploy/jaas --previous --tail=200 \
| grep -B2 -i 'reconcil\|publish' | tail -30
The last Reconcile log line before the kill names the snippet. Confirm via jaas_snippet_rendered_bytes if the operator’s metrics endpoint was reachable before the kill (the histogram captures bytes per Synced reconcile; a runaway snippet stands out).
Remediate
Cap the runaway snippet. Set
operator.storage.maxArtifactBytescluster-wide to refuse renders past a threshold (e.g.,16777216for 16 MiB). The Publisher fails the snippet withReasonArtifactTooLargeinstead of attempting the write. The operator pod stops OOM-restarting.Raise operator memory. The chart’s
resources.memorydefault is conservative (64Mi); a cluster with large rendered artifacts may need256Mior more. Update via:helm --namespace <jaas-ns> upgrade jaas oci://ghcr.io/metio/helm-charts/jaas \ --reuse-values --set resources.memory=256MiWait for
Sweep. The orphan.tar.gz.tmpfiles clear automatically once the surrounding issue is fixed andmaxTmpAge(default 30 min) elapses.jaas_storage_sweep_failures_totalflags any persistent issue.
For S3 backends, OOM during a multipart upload leaves an incomplete upload at the S3 endpoint — most providers expire these automatically (AWS S3: 7-day default). No JaaS-side action needed.
Multi-replica considerations
With leader election on (the chart default when operator mode is enabled), only the lease-holder writes to storage. A storage-incident on the lease-holder is the worst case: the standby reads but cannot fill the gap until the lease transfers. To force a handover during a storage incident on one replica:
kubectl --namespace <jaas-ns> delete lease <release-name>-operator
The next replica acquires the lease within LeaseDuration (15s default), and its Publisher writes against its own (presumably healthy) view of the backend.
Prevention
- Use
persistence.enabled: truein production. Default-off is for quick demos. - Run the chart’s opt-in PrometheusRule (
operator.metrics.prometheusRule.enabled: true) —JaaSSnippetArtifactGrowingcatches runaway tarballs before they fill the PVC. - Set
operator.storage.maxArtifactBytesto cap pathological snippets at admission time, not after they’ve written to disk. - For S3, configure a bucket lifecycle policy that does not delete tarballs the operator still considers live. The Publisher’s
Pruneonly deletes revisions the snippet’sstatus.historyno longer references.
JaaSStorageSweepFailures alert
Linked from the alert by name. The sweep is a background GC that removes orphaned .tar.gz.tmp residue left by Puts whose process died after the tmpfile landed but before the rename. The reconcile hot path is unaffected — Put still works — but stale .tmp files accumulate until the underlying issue is fixed.
Symptom: jaas_storage_sweep_failures_total increases over time; JaaSStorageSweepFailures alert fires after >3 failures/hour (configurable).
Diagnose:
# Operator logs carry the underlying sweep error:
kubectl --namespace <jaas-ns> logs deploy/jaas --tail=200 | grep "Storage sweep failed"
# For local backend: check the volume's free space + permissions.
kubectl --namespace <jaas-ns> exec deploy/jaas -- df -h /var/lib/jaas/artifacts
kubectl --namespace <jaas-ns> exec deploy/jaas -- ls -la /var/lib/jaas/artifacts
# For S3 backend: sweep is a no-op (Put is atomic, no .tmp residue),
# so this alert firing on S3 is a wiring bug. Confirm backend:
kubectl --namespace <jaas-ns> get deploy jaas --output jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep storage-backend
Remediate:
- Disk full → increase the PVC size or shrink
operator.storage.maxArtifactBytes. - Permission errors → ensure the operator’s
securityContext.runAsUsermatches the PVC’s filesystem ownership; reset withchownon a one-shot Job. - Sustained S3 listing throttling → unlikely on the local backend; for S3 this alert shouldn’t fire at all because Sweep is a no-op there.
Manual cleanup once the underlying issue is fixed:
kubectl --namespace <jaas-ns> exec deploy/jaas -- find /var/lib/jaas/artifacts -name '*.tar.gz.tmp' -mmin +30 -delete
WithdrawForced event on snippet deletion
If a snippet stuck in Terminating carries a Warning WithdrawForced Kubernetes Event, the operator has already done what it could — the finalizer was dropped after --max-withdraw-wait (default 1h) of failing Withdraws against the backend, and the snippet itself is GC’d. The tarball it owned is now orphaned in storage. To clean up:
# Read the elapsed time + last backend error from the event message:
kubectl describe jsonnetsnippet <name>
# Locate the orphan in the configured backend:
# local: <--storage-path>/<namespace>/<name>/<rev>.tar.gz
# s3: <--s3-prefix>/<namespace>/<name>/<rev>.tar.gz
# Remove it once the backend is reachable again.
A force-drop should be rare — it means the backend was broken for the full wait window. Investigate why before lowering maxWithdrawWait: aggressive timeouts make transient apiserver/S3 incidents cause orphans you’d otherwise have recovered from naturally.
Related runbooks
- artifacttoolarge.md — one snippet’s output exceeds the cap (different symptom: snippet Ready=False, not “URL unreachable”)
- sourcefetchfailed.md — JaaS consuming an upstream artifact, not its own storage