SLO Implementation
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Do not use this skill when
- •The task is unrelated to slo implementation
- •You need a different domain or tool outside this scope
Instructions
- •Clarify goals, constraints, and required inputs.
- •Apply relevant best practices and validate outcomes.
- •Provide actionable steps and verification.
- •If detailed examples are required, open
resources/implementation-playbook.md.
Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
Use this skill when
- •Define service reliability targets
- •Measure user-perceived reliability
- •Implement error budgets
- •Create SLO-based alerts
- •Track reliability goals
SLI/SLO/SLA Hierarchy
code
SLA (Service Level Agreement) ↓ Contract with customers SLO (Service Level Objective) ↓ Internal reliability target SLI (Service Level Indicator) ↓ Actual measurement
Defining SLIs
Common SLI Types
1. Availability SLI
promql
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
2. Latency SLI
promql
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
3. Durability SLI
code
# Successful writes / Total writes sum(storage_writes_successful_total) / sum(storage_writes_total)
Reference: See references/slo-definitions.md
Setting SLO Targets
Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Choose Appropriate SLOs
Consider:
- •User expectations
- •Business requirements
- •Current performance
- •Cost of reliability
- •Competitor benchmarks
Example SLOs:
yaml
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
Error Budget Calculation
Error Budget Formula
code
Error Budget = 1 - SLO Target
Example:
- •SLO: 99.9% availability
- •Error Budget: 0.1% = 43.2 minutes/month
- •Current Error: 0.05% = 21.6 minutes/month
- •Remaining Budget: 50%
Error Budget Policy
yaml
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability
Reference: See references/error-budget.md
SLO Implementation
Prometheus Recording Rules
yaml
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)
SLO Alerting Rules
yaml
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"
SLO Dashboard
Grafana Dashboard Structure:
code
┌────────────────────────────────────┐ │ SLO Compliance (Current) │ │ ✓ 99.95% (Target: 99.9%) │ ├────────────────────────────────────┤ │ Error Budget Remaining: 65% │ │ ████████░░ 65% │ ├────────────────────────────────────┤ │ SLI Trend (28 days) │ │ [Time series graph] │ ├────────────────────────────────────┤ │ Burn Rate Analysis │ │ [Burn rate by time window] │ └────────────────────────────────────┘
Example Queries:
promql
# Current SLO compliance sli:http_availability:ratio * 100 # Error budget remaining slo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate) (slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)
Multi-Window Burn Rate Alerts
yaml
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical
SLO Review Process
Weekly Review
- •Current SLO compliance
- •Error budget status
- •Trend analysis
- •Incident impact
Monthly Review
- •SLO achievement
- •Error budget usage
- •Incident postmortems
- •SLO adjustments
Quarterly Review
- •SLO relevance
- •Target adjustments
- •Process improvements
- •Tooling enhancements
Best Practices
- •Start with user-facing services
- •Use multiple SLIs (availability, latency, etc.)
- •Set achievable SLOs (don't aim for 100%)
- •Implement multi-window alerts to reduce noise
- •Track error budget consistently
- •Review SLOs regularly
- •Document SLO decisions
- •Align with business goals
- •Automate SLO reporting
- •Use SLOs for prioritization
Reference Files
- •
assets/slo-template.md- SLO definition template - •
references/slo-definitions.md- SLO definition patterns - •
references/error-budget.md- Error budget calculations
Related Skills
- •
prometheus-configuration- For metric collection - •
grafana-dashboards- For SLO visualization