Juicebox Incident Runbook
Overview
Standardized incident response procedures for Juicebox integration issues.
Incident Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 | Critical | < 15 min | Complete outage, data loss |
| P2 | High | < 1 hour | Major feature broken, degraded performance |
| P3 | Medium | < 4 hours | Minor feature issue, workaround exists |
| P4 | Low | < 24 hours | Cosmetic, non-blocking |
Quick Diagnostics
Step 1: Immediate Assessment
bash
#!/bin/bash
# quick-diag.sh - Run immediately when incident detected
echo "=== Juicebox Quick Diagnostics ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Check Juicebox status page
echo ""
echo "=== Juicebox Status ==="
curl -s https://status.juicebox.ai/api/status | jq '.status'
# Check our API health
echo ""
echo "=== Our API Health ==="
curl -s http://localhost:8080/health/ready | jq '.'
# Check recent error logs
echo ""
echo "=== Recent Errors (last 5 min) ==="
kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20
# Check metrics
echo ""
echo "=== Error Rate ==="
curl -s http://localhost:9090/api/v1/query?query=rate\(juicebox_requests_total\{status=\"error\"\}\[5m\]\) | jq '.data.result[0].value[1]'
Step 2: Identify Root Cause
markdown
## Incident Triage Decision Tree 1. Is Juicebox status page showing issues? - YES → External outage, skip to "External Outage Response" - NO → Continue 2. Are we getting authentication errors (401)? - YES → Check API key validity, skip to "Auth Issues" - NO → Continue 3. Are we getting rate limited (429)? - YES → Skip to "Rate Limit Response" - NO → Continue 4. Are requests timing out? - YES → Skip to "Timeout Response" - NO → Continue 5. Are we getting unexpected errors? - YES → Skip to "Application Error Response" - NO → Gather more data
Response Procedures
External Outage Response
markdown
## When Juicebox is Down 1. **Confirm Outage** - Check https://status.juicebox.ai - Verify with curl test to API 2. **Enable Fallback Mode** ```bash kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true
- •
Notify Stakeholders
- •Post to #incidents channel
- •Update status page if customer-facing
- •
Monitor Recovery
- •Set up alert for Juicebox status change
- •Prepare to disable fallback mode
- •
Post-Incident
- •Disable fallback when Juicebox recovers
- •Document timeline and impact
code
### Auth Issues Response
```markdown
## When Authentication Fails
1. **Verify API Key**
```bash
# Mask key for logging
echo "Key prefix: ${JUICEBOX_API_KEY:0:10}..."
# Test auth
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
https://api.juicebox.ai/v1/auth/me
- •
Check Key Status in Dashboard
- •Log into https://app.juicebox.ai
- •Verify key is active and not revoked
- •
Rotate Key if Compromised
- •Generate new key in dashboard
- •Update secret manager
- •Restart pods
bashkubectl rollout restart deployment/juicebox-integration
- •
Verify Recovery
- •Check health endpoint
- •Monitor error rate
code
### Rate Limit Response
```markdown
## When Rate Limited
1. **Check Current Usage**
```bash
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
https://api.juicebox.ai/v1/usage
- •
Immediate Mitigation
- •Enable aggressive caching
- •Reduce request rate
bashkubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10
- •
If Quota Exhausted
- •Contact Juicebox support for temporary increase
- •Implement request queuing
- •
Long-term Fix
- •Review usage patterns
- •Implement better caching
- •Consider plan upgrade
code
### Timeout Response ```markdown ## When Requests Timeout 1. **Check Network** ```bash # DNS resolution nslookup api.juicebox.ai # Connectivity curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health
- •
Check Load
- •Review query complexity
- •Check for unusually large requests
- •
Increase Timeout
bashkubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000
- •
Implement Circuit Breaker
- •Enable circuit breaker if repeated timeouts
- •Serve cached/fallback data
code
## Incident Communication Template ```markdown ## Incident Report Template **Incident ID:** INC-YYYY-MM-DD-XXX **Status:** Investigating | Identified | Monitoring | Resolved **Severity:** P1 | P2 | P3 | P4 **Start Time:** YYYY-MM-DD HH:MM UTC **End Time:** (when resolved) ### Summary [Brief description of the incident] ### Impact - Users affected: [number/percentage] - Features affected: [list] - Duration: [time] ### Timeline - HH:MM - Incident detected - HH:MM - Investigation started - HH:MM - Root cause identified - HH:MM - Mitigation applied - HH:MM - Incident resolved ### Root Cause [Description of what caused the incident] ### Resolution [What was done to fix it] ### Action Items - [ ] Action 1 (Owner, Due Date) - [ ] Action 2 (Owner, Due Date)
On-Call Checklist
markdown
## On-Call Handoff Checklist ### Before Shift - [ ] Access to all monitoring dashboards - [ ] VPN/access to production systems - [ ] Runbook bookmarked - [ ] Escalation contacts available ### During Shift - [ ] Check dashboards every 30 min - [ ] Respond to alerts within SLA - [ ] Document all incidents - [ ] Escalate P1/P2 immediately ### End of Shift - [ ] Handoff open incidents - [ ] Update incident log - [ ] Brief incoming on-call
Output
- •Diagnostic scripts
- •Response procedures
- •Communication templates
- •On-call checklists
Resources
Next Steps
After incident, see juicebox-data-handling for data management.