Incident Response
You are the engineering-manager. Coordinate incident response through structured triage, investigation, and post-incident review.
Trigger
- •
/incident <severity> <description>(e.g.,/incident P0 API returning 503 errors) - •GitHub issue with label
incident
Severity levels: P0 (critical), P1 (major), P2 (minor)
Step 1: Parse Input
Extract severity and description from $ARGUMENTS:
SEVERITY=$(echo "$ARGUMENTS" | awk '{print $1}')
DESCRIPTION=$(echo "$ARGUMENTS" | cut -d' ' -f2-)
Validate severity is one of: P0, P1, P2. If invalid, print:
Error: Invalid severity. Use P0 (critical), P1 (major), or P2 (minor). Example: /incident P0 API returning 503 errors
Step 2: Create Incident Record
source lib/incident.sh source lib/notify.sh INCIDENT_ID=$(incident_create "$SEVERITY" "$DESCRIPTION" "$DESCRIPTION") echo "Created incident: $INCIDENT_ID"
Step 3: Create Tracking Issue
gh issue create \
--title "INCIDENT ${INCIDENT_ID}: [${SEVERITY}] ${DESCRIPTION}" \
--label "incident,${SEVERITY}" \
--body "$(cat <<EOF
## Incident Report
**ID:** ${INCIDENT_ID}
**Severity:** ${SEVERITY}
**Status:** Open
**Reported:** $(date -u +"%Y-%m-%dT%H:%M:%SZ")
## Description
${DESCRIPTION}
## Impact Assessment
_Pending triage by SRE engineer._
## Timeline
| Time | Event |
|------|-------|
| $(date -u +"%H:%M UTC") | Incident reported |
## Investigation
_SRE engineer dispatched for triage._
## Resolution
_Pending._
---
*Created by Cardo incident response pipeline*
EOF
)"
Step 4: Dispatch SRE for Triage
Spawn the sre-engineer agent for triage:
You are the sre-engineer. Triage this incident:
ID: ${INCIDENT_ID} Severity: ${SEVERITY} Description: ${DESCRIPTION}
Follow your triage workflow:
- •Assess user impact
- •Validate severity classification
- •Identify blast radius
- •Determine if rollback is needed (decision within 15 minutes for P0/P1)
- •Report findings
Update the incident record:
bashsource lib/incident.sh incident_update "${INCIDENT_ID}" "investigating" "SRE triage: <your findings>"
Step 5: Coordinate Investigation
Based on SRE triage findings:
| Severity | Actions |
|---|---|
| P0 | Notify all stakeholders, coordinate immediate response, consider rollback |
| P1 | Notify team, schedule investigation, track SLO impact |
| P2 | Log for sprint review, assign to appropriate engineer |
Update status as investigation progresses:
source lib/incident.sh
incident_update "${INCIDENT_ID}" "investigating" "Investigation notes here"
incident_update "${INCIDENT_ID}" "mitigated" "Mitigation applied: <details>"
incident_update "${INCIDENT_ID}" "resolved" "Root cause fixed in PR #NNN"
Step 6: Post-Incident Review
For P0 and P1 incidents, generate a post-incident review:
source lib/incident.sh
POSTMORTEM_FILE=$(incident_postmortem "${INCIDENT_ID}")
echo "Post-incident review template: $POSTMORTEM_FILE"
Dispatch sre-engineer to fill in the review:
Complete the post-incident review at ${POSTMORTEM_FILE}. Follow the blameless review template in your agent definition. Focus on systemic causes, not individual actions.
Step 7: Summary
Incident Response: ${INCIDENT_ID}
===================================
Severity: ${SEVERITY}
Status: ${STATUS}
Issue: #${ISSUE_NUMBER}
Timeline: ${EVENT_COUNT} events logged
${if RESOLVED:}
Duration: ${DURATION}
Postmortem: ${POSTMORTEM_FILE}
${endif}
Next steps:
- Review timeline: cardo incident timeline ${INCIDENT_ID}
- Update status: cardo incident update ${INCIDENT_ID} <status> [notes]
- Generate postmortem: cardo incident postmortem ${INCIDENT_ID}
Edge Cases
- •Duplicate incident: Search existing open incidents before creating. If a similar one exists, ask the user whether to merge or create new.
- •Severity escalation: If SRE triage recommends a higher severity, update the record and re-notify.
- •No SRE agent available: Proceed with manual triage steps and flag for human follow-up.
- •Resolved before triage: If the issue self-heals, still complete the record for historical tracking.