Alerting

Talks about: alerts, observability, operator, and prometheus

JaaS turns a sustained problem into a notification two ways: a Prometheus PrometheusRule that pages on its metrics , and Kubernetes Events that Flux’s notification-controller can route to chat or e-mail.

The binary

The operator emits a standard Kubernetes Event on every Ready-condition transition — Normal for Synced, Warning for every other reason. The reason string fills both the event reason and action. These Events need no flag to enable; they are written whenever the operator reconciles.

The operator also threads runbook links into its own status automatically: every actionable Ready-condition Message gains a (runbook: https://jaas.projects.metio.wtf/runbooks/<reason>/) suffix, so kubectl describe jsonnetsnippet points straight at the matching page. Healthy or intentional states (Synced, Suspended, Pending) get no suffix.

Routing Events through Flux

Routing the Events is Flux’s notification-controller: target an Alert CR at kind: JsonnetSnippet and JaaS needs no Provider/Alert plumbing of its own.

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: jaas-snippets
  namespace: flux-system
spec:
  providerRef:
    name: slack
  eventSeverity: warn   # 'info' to include success events
  eventSources:
    - kind: JsonnetSnippet
      name: '*'

Wire whatever Provider you already use for Flux source CRs; see the Flux notification-controller documentation for provider configuration.

The alert catalog

The chart ships a starter alert set on the custom metrics plus a handful of controller-runtime signals. Each alert carries its remediation page as a runbook_url annotation so Alertmanager renders a direct link:

Alert	Severity	Fires when	Threshold knobs (default)	Runbook
`JaaSSnippetReconcileErrorsHigh`	warning	A snippet keeps flipping to Ready=False (excluding `Synced`/`Suspended`/`Pending`).	`reconcileErrorRate` (0.1/s), `reconcileErrorDuration` (10m)	per-reason page under `/runbooks/`
`JaaSSnippetArtifactGrowing`	warning	p99 `jaas_snippet_rendered_bytes` exceeds the size ceiling.	`artifactSizeBytes` (16 MiB), `artifactSizeDuration` (30m)	artifacttoolarge
`JaaSControllerWorkqueueDepthHigh`	warning	The `jsonnetsnippet` workqueue can’t drain.	`workqueueDepth` (50), `workqueueDuration` (15m)	workqueue-saturation
`JaaSReconcileLatencyHigh`	warning	p99 reconcile time crosses the ceiling.	`reconcileLatencySeconds` (30), `reconcileLatencyDuration` (15m)	reconcile-latency
`JaaSOperatorPodDown`	critical	A jaas pod stays NotReady.	`podDownDuration` (5m)	operator-pod-down
`JaaSStorageSweepFailures`	warning	Background sweeps fail per hour above the floor.	`sweepFailuresPerHour` (3), `sweepFailuresDuration` (30m)	storage-recovery
`JaaSWebhookCertRenewalFailing`	critical	Self-signed cert renewal fails per hour above the floor.	`webhookCertRenewalFailuresPerHour` (1), `webhookCertRenewalFailuresDuration` (30m)	webhook-cert-renewal
`JaaSTenantTokenMintFailing`	warning	Token mints fail for a `(namespace, serviceAccount)` pair.	`tenantTokenMintFailureRate` (0.01/s), `tenantTokenMintFailureDuration` (10m)	rbacdenied
`JaaSForceDropsAccumulating`	warning	Snippet finalizers are force-dropped per hour above the floor.	`forceDropsPerHour` (0), `forceDropsDuration` (5m)	storage-recovery
`JaaSCRDWatchEngagementFailing`	warning	A Flux source watch won’t engage for a GVK.	`crdWatchEngagementFailuresPerHour` (1), `crdWatchEngagementFailuresDuration` (30m)	crd-watch-engagement
`JaaSEvalSaturation`	warning	In-flight evals exceed the saturation ratio of the cap (guarded on the cap being non-zero).	`evalSaturationRatio` (0.9), `evalSaturationDuration` (10m)	eval-saturation
`JaaSEvalRejected`	warning	The semaphore turns evals away per second above the floor.	`evalRejectedRate` (0.05/s), `evalRejectedDuration` (10m)	eval-saturation
`JaaSEvalLeakedGoroutines`	warning	Orphan eval goroutines persist above the floor — a runaway snippet.	`evalLeakedFloor` (0), `evalLeakedDuration` (5m)	eval-saturation

JaaSSnippetReconcileErrorsHigh templates its runbook URL on the failing reason, so it lands on the matching per-reason page under /runbooks/ . Each Ready-condition reason and each alert maps to a remediation page there.

The Helm chart

The PrometheusRule is opt-in under operator.metrics.prometheusRule and needs the Prometheus Operator’s monitoring.coreos.com/v1 API in the cluster:

operator:
  enabled: true
  metrics:
    enabled: true
    prometheusRule:
      enabled: true
      interval: 30s
      # Labels your Prometheus instance selects PrometheusRules on.
      labels:
        release: kube-prometheus
      # Merged onto every rendered alert — route all jaas alerts
      # through one Alertmanager receiver.
      extraAlertLabels:
        team: platform
      # Annotation key the runbook URL lands under (Prometheus-operator
      # convention is runbook_url).
      runbookAnnotationKey: runbook_url

Every threshold is a knob under operator.metrics.prometheusRule.thresholds, so the noise floor is tunable without copy-pasting rule bodies. To silence a built-in alert, raise its threshold to an impossibly high value — there is no per-alert disable toggle, and the threshold pattern keeps “this alert is intentionally inert” visible in the chart values. Cluster-specific rules append under a separate group via operator.metrics.prometheusRule.extraRules.