Site Reliability Engineering (SRE)

Goal

Treat operations as a software problem. Quantify reliability so we know exactly when to freeze deployments (reliability at risk) and when to push fast (error budget available).

When to Use

•When defining "Is it stable enough?" criteria.
•After a production outage (Post-Mortem).
•When planning on-call rotations.

Instructions

1. Define SLIs (Service Level Indicators)

What is "good"?

•Availability: Successful requests / Total requests.
•Latency: Requests faster than 200ms / Total requests.

2. Set SLOs (Service Level Objectives)

What is the target? (100% is impossible).

•Target: "99.9% of requests in 30 days are successful."
•Window: Rolling 28 or 30 days.

3. Manage Error Budgets

(100% - SLO) = Error Budget.

•If you have 0.1% budget, you can fail 43 minutes a month.
•Rule: If budget is exhausted -> Code Freeze. Only reliability fixes allowed.

4. Incident Management

When things break:

•Detect: Alert fires.
•Respond: Acknowledge, triage, stabilize (mitigate impact).
•Analyze: Root cause analysis (5 Whys).
•Learn: Create action items to prevent recurrence.

Constraints

✅ Do

•DO: Blameless Post-Mortems. Focus on process failure, not human error.
•DO: Automate runbooks. If you run a command twice, script it.
•DO: Measure what matters to the user (Client-side latency), not just the server.

❌ Don't

•DON'T: Alert on things you can't fix immediately.
•DON'T: Page the whole team. Page the on-call engineer.
•DON'T: Optimize reliability past the SLO (diminishing returns).

Output Format

•SLOs.md: Definitions of SLIs and targets.
•post-mortems/YYYY-MM-DD-incident.md: Incident review records.

Dependencies

•devops/implementing-observability/SKILL.md