SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Do not use this skill when
- •The task is unrelated to slo implementation
- •You need a different domain or tool outside this scope
Instructions
- •Clarify goals, constraints, and required inputs.
- •Apply relevant best practices and validate outcomes.
- •Provide actionable steps and verification.
- •If detailed examples are required, open
resources/implementation-playbook.md.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
Use this skill when
- •Define service reliability targets
- •Measure user-perceived reliability
- •Implement error budgets
- •Create SLO-based alerts
- •Track reliability goals
SLI/SLO/SLA Hierarchy
code
SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement
Defining SLIs
Common SLI Types
1. Availability SLI
promql
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
2. Latency SLI
promql
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
3. Durability SLI
code
# Successful writes / Total writes sum(storage_writes_successful_total) / sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets
Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Choose Appropriate SLOs
Consider:
- •User expectations
- •Business requirements
- •Current performance
- •Cost of reliability
- •Competitor benchmarks
Example SLOs:
yaml
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation
Error Budget Formula
code
Error Budget = 1 - SLO Target
Example:
- •SLO: 99.9% availability
- •Error Budget: 0.1% = 43.2 minutes/month
- •Current Error: 0.05% = 21.6 minutes/month
- •Remaining Budget: 50%
Error Budget Policy
yaml
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation
Prometheus Recording Rules
yaml
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
SLO Alerting Rules
yaml
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget bu