AgentSkillsCN

slo-implementation

定义并实施服务级别指标(SLI)与服务级别目标(SLO),结合错误预算与告警机制。在设定可靠性目标、践行 SRE 实践或进行度量时,可加以应用。

SKILL.md
--- frontmatter
name: slo-implementation
description: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or meas
category: Document Processing
source: antigravity
tags: [api, ai, template, document, rag]
url: https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/slo-implementation

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Do not use this skill when

  • The task is unrelated to slo implementation
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

Use this skill when

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals

SLI/SLO/SLA Hierarchy

code
SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

promql
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

2. Latency SLI

promql
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

code
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.95%21.6 minutes4.38 hours
99.99%4.32 minutes52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks

Example SLOs:

yaml
slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

code
Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%

Error Budget Policy

yaml
error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

yaml
# SLI Recording Rules
groups:
  - name: sli_rules
    interval: 30s
    rules:
      # Availability SLI
      - record: sli:http_availability:ratio
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[28d]))
          /
          sum(rate(http_requests_total[28d]))

      # Latency SLI (requests < 500ms)
      - record: sli:http_latency:ratio
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
          /
          sum(rate(http_request_duration_seconds_count[28d]))

  - name: slo_rules
    interval: 5m
    rules:
      # SLO compliance (1 = meeting SLO, 0 = violating)
      - record: slo:http_availability:compliance
        expr: sli:http_availability:ratio >= bool 0.999

      - record: slo:http_latency:compliance
        expr: sli:http_latency:ratio >= bool 0.99

      # Error budget remaining (percentage)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

      # Error budget burn rate
      - record: slo:http_availability:burn_rate_5m
        expr: |
          (1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          )) / (1 - 0.999)

SLO Alerting Rules

yaml
groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget bu