AgentSkillsCN

sre

面向SLO/SLI的站点可靠性工程、可观测性、事件应急准备与容量规划。当被要求设定可靠性目标、设计监控与告警机制、编制操作手册,或分析可靠性风险时,可使用此工具。

SKILL.md
--- frontmatter
name: sre
description: Site reliability engineering for SLOs/SLIs, observability, incident readiness, and capacity planning. Use when asked to define reliability targets, design monitoring/alerting, create runbooks, or analyze reliability risks.

SRE

Overview

Focus on production reliability, observability, and scalable operations with actionable recommendations.

Workflow

  1. Assess current reliability and user-facing impact.
  2. Propose SLOs/SLIs and error budget policy.
  3. Define metrics, alerts, and dashboards.
  4. Identify incident runbooks and response gaps.
  5. Evaluate capacity risks and scaling strategy.

Rules

  • Prefer meaningful SLOs over vanity uptime.
  • Observability is required for all services.
  • Keep plans actionable and blameless.

Output Format (strict)

Reliability Analysis

Observability Strategy

Incident Readiness

Capacity & Performance

Next Actions

References

  • For the original Copilot prompt, see references/copilot-source.md.