Service level objectives

Talks about: alerts, metrics, operator, and slo

The JaaS operator tracks two service-level objectives. Each is an objective on a service-level indicator (SLI) computed from the metrics , measured over a rolling window. The published dashboard renders them, and the Helm chart can alert on them.

SLO 1 — reconcile availability

Objective: ≥ 99% of syncing reconciles reach Ready=True, over a 28-day window.

The SLI counts only reconciles that were actually trying to sync — the intentional Pending and Suspended states are excluded from both halves:

sum(rate(jaas_snippet_reconcile_total{status="True"}[28d]))
/
(
  sum(rate(jaas_snippet_reconcile_total{status="True"}[28d]))
  + sum(rate(jaas_snippet_reconcile_total{status="False",reason!~"Suspended|Pending"}[28d]))
)

The error budget is the 1% of reconciles allowed to fail over the window. Remaining budget, normalised so 1 is full and 0 is exhausted:

(<availability> - 0.99) / (1 - 0.99)

SLO 2 — reconcile latency

Objective: the JsonnetSnippet controller’s p95 reconcile duration stays below 30s over the window.

histogram_quantile(0.95, sum by (le) (
  rate(controller_runtime_reconcile_time_seconds_bucket{controller="jsonnetsnippet"}[28d])
))

See the SLOs on the dashboard

The published dashboard opens with an SLO band: current availability against its objective, error budget remaining, p95 latency against its objective, and an availability-versus-objective trend. The operator-internals panels below explain any movement.

The objectives and window are top-level arguments, so you set them per environment when you render the dashboard through a JsonnetSnippet:

spec:
  tlas:
    datasource: ["prometheus"]   # your Prometheus datasource UID
    window: ["28d"]              # SLO window
    availabilityTarget: ["0.99"] # 99%
    latencyTarget: ["30"]        # seconds

A short window is fine for a demo; a real 28d SLI needs at least that much Prometheus retention. For long windows, precompute the SLI with a recording rule and point window at the recorded series instead of a raw rate(...[28d]).

Alert on the budget

The shipped alerts already page on the causes of SLO loss (reconcile errors, latency, eval saturation). To alert on the objective itself, add an availability-SLO rule through the chart’s extraRules passthrough — here it fires when recent availability drops below 99%:

operator:
  metrics:
    prometheusRule:
      enabled: true
      extraRules:
        - alert: JaaSReconcileAvailabilityBelowSLO
          expr: |
            (
              sum(rate(jaas_snippet_reconcile_total{status="True"}[1h]))
              /
              (
                sum(rate(jaas_snippet_reconcile_total{status="True"}[1h]))
                + sum(rate(jaas_snippet_reconcile_total{status="False",reason!~"Suspended|Pending"}[1h]))
              )
            ) < 0.99
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: JaaS reconcile availability is below its 99% objective

The alert measures a short recent window (1h) so it pages while the budget is actively burning; the dashboard’s window shows the full SLO window. See Alerting for the rest of the catalog.