AgentSkillsCN

operations

当用户提出“为生产环境做评审”“检查生产就绪度”“评估系统韧性”“评估可观测性”“审视运维流程”“开展混沌实验”,或在讨论部署、监控、故障响应、失效模式,以及混沌工程时,应使用此技能。

SKILL.md
--- frontmatter
name: operations
description: This skill should be used when the user asks to "review for production", "check production readiness", "evaluate resilience", "assess observability", "review ops", "run chaos experiments", or discusses deployment, monitoring, incident response, failure modes, or chaos engineering.

Operations

Production readiness evaluation focused on resilience, observability, and incident response.

Resilience

Failure Modes

  • What can fail? List all external dependencies
  • Blast radius: If X fails, what else breaks?
  • Graceful degradation: Partial failure ≠ total failure?

Patterns

PatternPurposeCheck
TimeoutsPrevent hung connectionsEvery external call has one?
Circuit BreakerStop cascading failuresOn critical paths?
BulkheadIsolate failuresSeparate thread pools?
RetryHandle transient failuresWith backoff? Bounded?

Observability

The RED Method

MetricWhatWhy
RateRequests per secondTraffic understanding
ErrorsFailed requestsProblem detection
DurationLatency distributionPerformance tracking

Logging

  • Structured (JSON, not free text)
  • Correlation IDs across services
  • Appropriate levels (not everything is ERROR)
  • PII redaction

Tracing

  • Distributed tracing enabled?
  • Spans for all external calls?
  • Context propagation working?

Capacity

  • Scaling: Horizontal preferred, auto-scaling configured?
  • Limits: Memory, CPU, connections all bounded?
  • Backpressure: What happens at 2x load? 10x?
  • Rate Limiting: Per-tenant/client quotas?

Security Posture

  • Secrets: In vault, not env vars or code
  • Network: Least privilege, mTLS where possible
  • Dependencies: Vulnerability scanning in CI
  • Access: Audit logging for sensitive operations

Incident Readiness

  • Runbooks: Documented recovery procedures
  • On-call: Rotation defined, escalation clear
  • Rollback: One-click, tested regularly
  • Communication: Status page, stakeholder notification

Checklist

code
□ All external calls have timeouts
□ Circuit breakers on critical paths
□ Structured logging with correlation IDs
□ RED metrics exposed
□ Alerts are actionable (not noisy)
□ Auto-scaling configured with limits
□ Graceful shutdown implemented
□ Health checks (liveness + readiness)
□ Secrets in vault
□ Runbook exists
□ Rollback tested

Guild Members for Operations

Primary: Taleb (resilience), Erlang (capacity), Vector (security) Secondary: Lamport (distributed failure), Ixian (metrics/validation)

Additional Resources

  • references/chaos-patterns.md — Chaos engineering patterns and failure injection