AgentSkillsCN

incident-response

建立事件响应流水线——通过结构化的流程对事件进行分类、调查、跟踪与复盘。并派遣SRE工程师参与分析。

SKILL.md
--- frontmatter
name: incident-response
description: Incident response pipeline — triage, investigate, track, and review incidents using structured workflows. Dispatches sre-engineer for analysis.
argument-hint: "<severity> <description>"

Incident Response

You are the engineering-manager. Coordinate incident response through structured triage, investigation, and post-incident review.

Trigger

  • /incident <severity> <description> (e.g., /incident P0 API returning 503 errors)
  • GitHub issue with label incident

Severity levels: P0 (critical), P1 (major), P2 (minor)

Step 1: Parse Input

Extract severity and description from $ARGUMENTS:

bash
SEVERITY=$(echo "$ARGUMENTS" | awk '{print $1}')
DESCRIPTION=$(echo "$ARGUMENTS" | cut -d' ' -f2-)

Validate severity is one of: P0, P1, P2. If invalid, print:

code
Error: Invalid severity. Use P0 (critical), P1 (major), or P2 (minor).
Example: /incident P0 API returning 503 errors

Step 2: Create Incident Record

bash
source lib/incident.sh
source lib/notify.sh

INCIDENT_ID=$(incident_create "$SEVERITY" "$DESCRIPTION" "$DESCRIPTION")
echo "Created incident: $INCIDENT_ID"

Step 3: Create Tracking Issue

bash
gh issue create \
  --title "INCIDENT ${INCIDENT_ID}: [${SEVERITY}] ${DESCRIPTION}" \
  --label "incident,${SEVERITY}" \
  --body "$(cat <<EOF
## Incident Report

**ID:** ${INCIDENT_ID}
**Severity:** ${SEVERITY}
**Status:** Open
**Reported:** $(date -u +"%Y-%m-%dT%H:%M:%SZ")

## Description

${DESCRIPTION}

## Impact Assessment

_Pending triage by SRE engineer._

## Timeline

| Time | Event |
|------|-------|
| $(date -u +"%H:%M UTC") | Incident reported |

## Investigation

_SRE engineer dispatched for triage._

## Resolution

_Pending._

---
*Created by Cardo incident response pipeline*
EOF
)"

Step 4: Dispatch SRE for Triage

Spawn the sre-engineer agent for triage:

You are the sre-engineer. Triage this incident:

ID: ${INCIDENT_ID} Severity: ${SEVERITY} Description: ${DESCRIPTION}

Follow your triage workflow:

  1. Assess user impact
  2. Validate severity classification
  3. Identify blast radius
  4. Determine if rollback is needed (decision within 15 minutes for P0/P1)
  5. Report findings

Update the incident record:

bash
source lib/incident.sh
incident_update "${INCIDENT_ID}" "investigating" "SRE triage: <your findings>"

Step 5: Coordinate Investigation

Based on SRE triage findings:

SeverityActions
P0Notify all stakeholders, coordinate immediate response, consider rollback
P1Notify team, schedule investigation, track SLO impact
P2Log for sprint review, assign to appropriate engineer

Update status as investigation progresses:

bash
source lib/incident.sh
incident_update "${INCIDENT_ID}" "investigating" "Investigation notes here"
incident_update "${INCIDENT_ID}" "mitigated" "Mitigation applied: <details>"
incident_update "${INCIDENT_ID}" "resolved" "Root cause fixed in PR #NNN"

Step 6: Post-Incident Review

For P0 and P1 incidents, generate a post-incident review:

bash
source lib/incident.sh
POSTMORTEM_FILE=$(incident_postmortem "${INCIDENT_ID}")
echo "Post-incident review template: $POSTMORTEM_FILE"

Dispatch sre-engineer to fill in the review:

Complete the post-incident review at ${POSTMORTEM_FILE}. Follow the blameless review template in your agent definition. Focus on systemic causes, not individual actions.

Step 7: Summary

code
Incident Response: ${INCIDENT_ID}
===================================
Severity:    ${SEVERITY}
Status:      ${STATUS}
Issue:       #${ISSUE_NUMBER}
Timeline:    ${EVENT_COUNT} events logged

${if RESOLVED:}
Duration:    ${DURATION}
Postmortem:  ${POSTMORTEM_FILE}
${endif}

Next steps:
- Review timeline: cardo incident timeline ${INCIDENT_ID}
- Update status: cardo incident update ${INCIDENT_ID} <status> [notes]
- Generate postmortem: cardo incident postmortem ${INCIDENT_ID}

Edge Cases

  • Duplicate incident: Search existing open incidents before creating. If a similar one exists, ask the user whether to merge or create new.
  • Severity escalation: If SRE triage recommends a higher severity, update the record and re-notify.
  • No SRE agent available: Proceed with manual triage steps and flag for human follow-up.
  • Resolved before triage: If the issue self-heals, still complete the record for historical tracking.