skill
--- name: postmortem-author description: 'Generate Sunkworks-style post-mortem reports with timeline reconstruction, failure pattern recognition, honest technical assessments, and recovery playbooks. Trigger with /postmortem' allowed-tools: ['read_file', 'create_file', 'replace_string_in_file', 'run_in_terminal', 'grep_search', 'semantic_search', 'get_terminal_output'] tags: ['sunkworks', 'live-stream', 'failure-tolerant'] --- # Post-Mortem Author **"Every failure is a future blog post."** I generate Sunkworks-style post-mortem reports that embrace the honest, educational nature of live troubleshooting. These reports acknowledge what went wrong, document the iterative debugging process, and create actionable recovery playbooks. **Philosophy**: Unlike corporate blameless post-mortems that often sanitize reality, Sunkworks post-mortems celebrate the messy truth of debugging live systems. We document the wrong turns, the red herrings, and the "oh, it was THAT the whole time" moments. ## Slash Command ### `/postmortem` Generates a post-mortem report: 1. Collect Kubernetes events from the incident window 2. Analyze Terraform state history for infrastructure changes 3. Identify failure patterns from previous incidents 4. Generate structured report with timeline and recovery steps **Usage**: Type `/postmortem` after an incident to generate the report. **Arguments**: - `/postmortem <hours-ago>` - Analyze last N hours (default: 4) - `/postmortem <start-time> <end-time>` - Specific time window **Script Verification**: Before executing, verify the script integrity: ```bash sha256sum .github/skills/postmortem-author/scripts/collect-timeline.sh # Expected: 9fa5f345e459dd8c05337d0ff2d026f2d3e588c3169c42a01c7ffc0a5c1b0528
Execute timeline collection:
bash
bash .github/skills/postmortem-author/scripts/collect-timeline.sh
When I Activate
- •
/postmortem(slash command) - •"Generate post-mortem"
- •"Write incident report"
- •"What went wrong"
- •"Document the failure"
- •"Create recovery playbook"
- •"Episode notes"
- •"Analyze the outage"
Expected Failure Modes
Data Collection Failures
| Failure Mode | Symptoms | Workaround |
|---|---|---|
| Events pruned | Kubernetes events older than 1hr missing | Check etcd, use Prometheus metrics |
| Terraform state locked | Cannot read state history | Use terraform force-unlock or read state file directly |
| Logs rotated | Container logs unavailable | Check persistent log storage, Loki if available |
| Time sync drift | Event timestamps don't correlate | Document clock skew, use relative ordering |
Analysis Failures
| Failure Mode | Symptoms | Est. Impact |
|---|---|---|
| Incomplete timeline | Gaps in event sequence | May miss root cause |
| False pattern match | Similarity to previous incident misleads | Add "false alarm" to pattern database |
| Missing context | Key decisions undocumented during incident | Interview participants, check stream VOD |
Post-Mortem Template
Standard Sunkworks Format
markdown
# Post-Mortem: [Incident Title] **Date**: YYYY-MM-DD **Duration**: X hours Y minutes **Severity**: [CRITICAL|HIGH|MEDIUM|LOW] **Episode**: Sunkworks #NN (if applicable) ## TL;DR One paragraph summary. What broke, how we fixed it, what we learned. ## Timeline | Time (UTC) | Event | Source | |------------|-------|--------| | HH:MM | First symptom observed | Prometheus alert | | HH:MM | Investigation started | Stream timestamp | | HH:MM | Root cause identified | kubectl describe | | HH:MM | Fix applied | git commit SHA | | HH:MM | Service restored | Flux reconciliation | ## What Went Wrong ### Root Cause Technical explanation of the failure. Be specific. ### Contributing Factors - Factor 1: Why this made things worse - Factor 2: Why detection was delayed - Factor 3: Why recovery took longer than expected ### Red Herrings Things we investigated that weren't the problem: - Thing 1: Why we thought it was this, why it wasn't - Thing 2: The suspicious timing that was coincidental ## The Debugging Journey ### Attempt 1: [What We Tried] **Hypothesis**: What we thought was wrong **Action**: What we did **Result**: What happened **Time Spent**: X minutes ### Attempt 2: [What We Tried Next] **Hypothesis**: Refined theory **Action**: Different approach **Result**: Getting warmer / still wrong **Time Spent**: X minutes ### The Breakthrough What finally led us to the answer. Often: "Then we noticed..." ## Recovery Steps 1. Step one (include exact commands) 2. Step two 3. Step three 4. Verification that it worked ## Failure Pattern Analysis ### Have We Seen This Before? - Episode N: Similar symptoms, different cause - Episode M: Same root cause, different symptoms ### Pattern Category [NEW PATTERN | KNOWN VARIANT | RECURRENCE] ### Pattern Triggers What conditions led to this failure: - Trigger 1 - Trigger 2 ## Prevention & Detection ### Immediate Actions (This Week) - [ ] Action 1: Owner: @name - [ ] Action 2: Owner: @name ### Long-term Improvements - [ ] Improvement 1: Target date - [ ] Improvement 2: Target date ### New Alerts/Monitors - Alert 1: What it detects - Alert 2: Earlier warning for this failure mode ## Metrics - **Time to Detect (TTD)**: X minutes from failure to first alert - **Time to Understand (TTU)**: X minutes from alert to root cause - **Time to Recover (TTR)**: X minutes from root cause to resolution - **Total Downtime**: X hours Y minutes ## The Honest Assessment ### What We Did Well - Thing 1 - Thing 2 ### What We Could Have Done Better - Thing 1: How it would have helped - Thing 2: Why we didn't do it ### The Lesson One key takeaway from this incident. ## Recovery Playbook For future occurrences of this failure pattern: \`\`\`bash # Step 1: Verify the failure mode command to check # Step 2: Apply the fix command to fix # Step 3: Verify recovery command to verify \`\`\` Estimated recovery time: X minutes (if you follow this playbook)
Core Capabilities
1. Timeline Reconstruction
bash
# Collect Kubernetes events from last 4 hours
kubectl get events -A --sort-by=.lastTimestamp | \
awk -v cutoff="$(date -u -d '4 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
'$1 >= cutoff {print}'
# Get Flux reconciliation history
kubectl get kustomization -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.lastAttemptedRevision}{"\t"}{.status.conditions[-1].lastTransitionTime}{"\n"}{end}'
# Check HelmRelease history
kubectl get helmrelease -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.lastAttemptedRevision}{"\t"}{.status.conditions[-1].lastTransitionTime}{"\n"}{end}'
2. Terraform State Analysis
bash
# List Terraform state history (S3/GCS backend) terraform state list # Show specific resource history terraform show -json | jq '.values.root_module.resources[] | select(.address == "<resource>")' # Compare states (if using versioned backend) terraform state pull > current-state.json # Download previous version from backend and compare diff <(jq -S . previous-state.json) <(jq -S . current-state.json)
3. Failure Pattern Recognition
Query the pattern database for similar incidents:
bash
# Search for similar symptoms in previous post-mortems grep -r "etcd" docs/postmortems/*.md grep -r "quorum" docs/postmortems/*.md # Find incidents with same affected components grep -l "helm-controller" docs/postmortems/*.md
Known Sunkworks Failure Patterns
| Pattern ID | Name | Key Symptoms | Episodes |
|---|---|---|---|
| PM-001 | Tailscale Extension Hang | Node NotReady, network init timeout | #3, #7 |
| PM-002 | etcd Disk Pressure | Leader election failures, slow API | #5 |
| PM-003 | HelmRelease Timeout | Chart fetch succeeds, install hangs | #2, #4 |
| PM-004 | Synology NFS Stale | Pods stuck ContainerCreating on mount | #6 |
| PM-005 | DNS Resolution Loop | CoreDNS->PiHole->CoreDNS cycle | #8 |
4. "What Went Wrong" Generator
Structured analysis prompts:
markdown
## Analysis Framework ### The Failure What specifically stopped working? Be precise: - Service X returned errors - Pods could not schedule - Network connectivity lost between A and B ### The Trigger What change or event preceded the failure? - Deployment of version X - Certificate expiration - Upstream dependency failure - "Nothing changed" (always investigate this claim) ### The Amplifier What made this worse than it could have been? - Retry storms - Cascading failures - Missing alerts - Documentation gaps ### The Discovery How was the problem found? - User report - Monitoring alert - Manual observation - Flux reconciliation failure
Recovery Playbooks
Generate Playbook From Incident
After resolving an incident, extract the recovery steps:
bash
#!/bin/bash # generate-playbook.sh echo "# Recovery Playbook: $1" echo "" echo "## Symptoms" echo "- Symptom 1" echo "- Symptom 2" echo "" echo "## Prerequisites" echo "- Access to cluster" echo "- kubectl configured" echo "" echo "## Steps" echo "" echo "### Step 1: Verify Failure Mode" echo "\`\`\`bash" echo "# Command to confirm this is the right playbook" echo "\`\`\`" echo "" echo "### Step 2: Apply Fix" echo "\`\`\`bash" echo "# Commands from the successful recovery" echo "\`\`\`" echo "" echo "### Step 3: Verify" echo "\`\`\`bash" echo "# Commands to confirm recovery" echo "\`\`\`" echo "" echo "## Estimated Time: X minutes" echo "" echo "## Related Incidents" echo "- [Incident Title](./incident-file.md)"
Integration Points
- •SOS Emergency: Use after
/sosrecovery to document what happened - •Flux Operator: Pull GitOps event history for timeline
- •Prometheus Observer: Include alert firing/resolution times
Episode Time Estimates
Post-Mortem Writing Time (by complexity)
| Incident Type | Data Collection | Analysis | Writing | Total |
|---|---|---|---|---|
| Simple (one component) | 10 min | 15 min | 20 min | 45 min |
| Medium (multi-component) | 20 min | 30 min | 30 min | 1.5 hr |
| Complex (cascading failure) | 30 min | 45 min | 45 min | 2 hr |
| Epic (everything broke) | 45 min | 1 hr | 1 hr | 2.5 hr |
Sunkworks Episode Notes
"The audience learns more from our failures than our successes. Document both."
Capturing Live Debug Context
During stream incidents, capture:
- •Stream timestamp when symptom first noticed
- •Chat suggestions that were tried
- •The moment of realization (for the highlight reel)
- •What the "fix" actually was vs. what chat thought it was
code