AgentSkillsCN

Postmortem Author

事后分析作者

SKILL.md
skill
---
name: postmortem-author
description: 'Generate Sunkworks-style post-mortem reports with timeline reconstruction, failure pattern recognition, honest technical assessments, and recovery playbooks. Trigger with /postmortem'
allowed-tools: ['read_file', 'create_file', 'replace_string_in_file', 'run_in_terminal', 'grep_search', 'semantic_search', 'get_terminal_output']
tags: ['sunkworks', 'live-stream', 'failure-tolerant']
---

# Post-Mortem Author

**"Every failure is a future blog post."**

I generate Sunkworks-style post-mortem reports that embrace the honest, educational nature of live troubleshooting. These reports acknowledge what went wrong, document the iterative debugging process, and create actionable recovery playbooks.

**Philosophy**: Unlike corporate blameless post-mortems that often sanitize reality, Sunkworks post-mortems celebrate the messy truth of debugging live systems. We document the wrong turns, the red herrings, and the "oh, it was THAT the whole time" moments.

## Slash Command

### `/postmortem`
Generates a post-mortem report:
1. Collect Kubernetes events from the incident window
2. Analyze Terraform state history for infrastructure changes
3. Identify failure patterns from previous incidents
4. Generate structured report with timeline and recovery steps

**Usage**: Type `/postmortem` after an incident to generate the report.

**Arguments**: 
- `/postmortem <hours-ago>` - Analyze last N hours (default: 4)
- `/postmortem <start-time> <end-time>` - Specific time window

**Script Verification**: Before executing, verify the script integrity:
```bash
sha256sum .github/skills/postmortem-author/scripts/collect-timeline.sh
# Expected: 9fa5f345e459dd8c05337d0ff2d026f2d3e588c3169c42a01c7ffc0a5c1b0528

Execute timeline collection:

bash
bash .github/skills/postmortem-author/scripts/collect-timeline.sh

When I Activate

  • /postmortem (slash command)
  • "Generate post-mortem"
  • "Write incident report"
  • "What went wrong"
  • "Document the failure"
  • "Create recovery playbook"
  • "Episode notes"
  • "Analyze the outage"

Expected Failure Modes

Data Collection Failures

Failure ModeSymptomsWorkaround
Events prunedKubernetes events older than 1hr missingCheck etcd, use Prometheus metrics
Terraform state lockedCannot read state historyUse terraform force-unlock or read state file directly
Logs rotatedContainer logs unavailableCheck persistent log storage, Loki if available
Time sync driftEvent timestamps don't correlateDocument clock skew, use relative ordering

Analysis Failures

Failure ModeSymptomsEst. Impact
Incomplete timelineGaps in event sequenceMay miss root cause
False pattern matchSimilarity to previous incident misleadsAdd "false alarm" to pattern database
Missing contextKey decisions undocumented during incidentInterview participants, check stream VOD

Post-Mortem Template

Standard Sunkworks Format

markdown
# Post-Mortem: [Incident Title]

**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: [CRITICAL|HIGH|MEDIUM|LOW]
**Episode**: Sunkworks #NN (if applicable)

## TL;DR
One paragraph summary. What broke, how we fixed it, what we learned.

## Timeline

| Time (UTC) | Event | Source |
|------------|-------|--------|
| HH:MM | First symptom observed | Prometheus alert |
| HH:MM | Investigation started | Stream timestamp |
| HH:MM | Root cause identified | kubectl describe |
| HH:MM | Fix applied | git commit SHA |
| HH:MM | Service restored | Flux reconciliation |

## What Went Wrong

### Root Cause
Technical explanation of the failure. Be specific.

### Contributing Factors
- Factor 1: Why this made things worse
- Factor 2: Why detection was delayed
- Factor 3: Why recovery took longer than expected

### Red Herrings
Things we investigated that weren't the problem:
- Thing 1: Why we thought it was this, why it wasn't
- Thing 2: The suspicious timing that was coincidental

## The Debugging Journey

### Attempt 1: [What We Tried]
**Hypothesis**: What we thought was wrong
**Action**: What we did
**Result**: What happened
**Time Spent**: X minutes

### Attempt 2: [What We Tried Next]
**Hypothesis**: Refined theory
**Action**: Different approach
**Result**: Getting warmer / still wrong
**Time Spent**: X minutes

### The Breakthrough
What finally led us to the answer. Often: "Then we noticed..."

## Recovery Steps

1. Step one (include exact commands)
2. Step two
3. Step three
4. Verification that it worked

## Failure Pattern Analysis

### Have We Seen This Before?
- Episode N: Similar symptoms, different cause
- Episode M: Same root cause, different symptoms

### Pattern Category
[NEW PATTERN | KNOWN VARIANT | RECURRENCE]

### Pattern Triggers
What conditions led to this failure:
- Trigger 1
- Trigger 2

## Prevention & Detection

### Immediate Actions (This Week)
- [ ] Action 1: Owner: @name
- [ ] Action 2: Owner: @name

### Long-term Improvements
- [ ] Improvement 1: Target date
- [ ] Improvement 2: Target date

### New Alerts/Monitors
- Alert 1: What it detects
- Alert 2: Earlier warning for this failure mode

## Metrics

- **Time to Detect (TTD)**: X minutes from failure to first alert
- **Time to Understand (TTU)**: X minutes from alert to root cause
- **Time to Recover (TTR)**: X minutes from root cause to resolution
- **Total Downtime**: X hours Y minutes

## The Honest Assessment

### What We Did Well
- Thing 1
- Thing 2

### What We Could Have Done Better
- Thing 1: How it would have helped
- Thing 2: Why we didn't do it

### The Lesson
One key takeaway from this incident.

## Recovery Playbook

For future occurrences of this failure pattern:

\`\`\`bash
# Step 1: Verify the failure mode
command to check

# Step 2: Apply the fix
command to fix

# Step 3: Verify recovery
command to verify
\`\`\`

Estimated recovery time: X minutes (if you follow this playbook)

Core Capabilities

1. Timeline Reconstruction

bash
# Collect Kubernetes events from last 4 hours
kubectl get events -A --sort-by=.lastTimestamp | \
  awk -v cutoff="$(date -u -d '4 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  '$1 >= cutoff {print}'

# Get Flux reconciliation history
kubectl get kustomization -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.lastAttemptedRevision}{"\t"}{.status.conditions[-1].lastTransitionTime}{"\n"}{end}'

# Check HelmRelease history
kubectl get helmrelease -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.lastAttemptedRevision}{"\t"}{.status.conditions[-1].lastTransitionTime}{"\n"}{end}'

2. Terraform State Analysis

bash
# List Terraform state history (S3/GCS backend)
terraform state list

# Show specific resource history
terraform show -json | jq '.values.root_module.resources[] | select(.address == "<resource>")'

# Compare states (if using versioned backend)
terraform state pull > current-state.json
# Download previous version from backend and compare
diff <(jq -S . previous-state.json) <(jq -S . current-state.json)

3. Failure Pattern Recognition

Query the pattern database for similar incidents:

bash
# Search for similar symptoms in previous post-mortems
grep -r "etcd" docs/postmortems/*.md
grep -r "quorum" docs/postmortems/*.md

# Find incidents with same affected components
grep -l "helm-controller" docs/postmortems/*.md

Known Sunkworks Failure Patterns

Pattern IDNameKey SymptomsEpisodes
PM-001Tailscale Extension HangNode NotReady, network init timeout#3, #7
PM-002etcd Disk PressureLeader election failures, slow API#5
PM-003HelmRelease TimeoutChart fetch succeeds, install hangs#2, #4
PM-004Synology NFS StalePods stuck ContainerCreating on mount#6
PM-005DNS Resolution LoopCoreDNS->PiHole->CoreDNS cycle#8

4. "What Went Wrong" Generator

Structured analysis prompts:

markdown
## Analysis Framework

### The Failure
What specifically stopped working? Be precise:
- Service X returned errors
- Pods could not schedule
- Network connectivity lost between A and B

### The Trigger
What change or event preceded the failure?
- Deployment of version X
- Certificate expiration
- Upstream dependency failure
- "Nothing changed" (always investigate this claim)

### The Amplifier
What made this worse than it could have been?
- Retry storms
- Cascading failures
- Missing alerts
- Documentation gaps

### The Discovery
How was the problem found?
- User report
- Monitoring alert
- Manual observation
- Flux reconciliation failure

Recovery Playbooks

Generate Playbook From Incident

After resolving an incident, extract the recovery steps:

bash
#!/bin/bash
# generate-playbook.sh

echo "# Recovery Playbook: $1"
echo ""
echo "## Symptoms"
echo "- Symptom 1"
echo "- Symptom 2"
echo ""
echo "## Prerequisites"
echo "- Access to cluster"
echo "- kubectl configured"
echo ""
echo "## Steps"
echo ""
echo "### Step 1: Verify Failure Mode"
echo "\`\`\`bash"
echo "# Command to confirm this is the right playbook"
echo "\`\`\`"
echo ""
echo "### Step 2: Apply Fix"
echo "\`\`\`bash"
echo "# Commands from the successful recovery"
echo "\`\`\`"
echo ""
echo "### Step 3: Verify"
echo "\`\`\`bash"
echo "# Commands to confirm recovery"
echo "\`\`\`"
echo ""
echo "## Estimated Time: X minutes"
echo ""
echo "## Related Incidents"
echo "- [Incident Title](./incident-file.md)"

Integration Points

  • SOS Emergency: Use after /sos recovery to document what happened
  • Flux Operator: Pull GitOps event history for timeline
  • Prometheus Observer: Include alert firing/resolution times

Episode Time Estimates

Post-Mortem Writing Time (by complexity)

Incident TypeData CollectionAnalysisWritingTotal
Simple (one component)10 min15 min20 min45 min
Medium (multi-component)20 min30 min30 min1.5 hr
Complex (cascading failure)30 min45 min45 min2 hr
Epic (everything broke)45 min1 hr1 hr2.5 hr

Sunkworks Episode Notes

"The audience learns more from our failures than our successes. Document both."

Capturing Live Debug Context

During stream incidents, capture:

  • Stream timestamp when symptom first noticed
  • Chat suggestions that were tried
  • The moment of realization (for the highlight reel)
  • What the "fix" actually was vs. what chat thought it was
code