RCA-CI: Root Cause Analysis from CI Logs
Systematic root cause analysis when you only have access to CI logs and artifacts.
rca:ci vs rca:hypershift
| Aspect | rca:ci | rca:hypershift |
|---|---|---|
| Access | CI logs only | Full cluster access |
| Data | Static artifacts | Live state + logs |
| When | CI failure, no cluster | Have running cluster |
| Escalate | → rca:hypershift if need live debugging | N/A |
When to Use
- •CI pipeline failed
- •No HyperShift cluster available
- •Need to understand failure before deciding on fix
- •Before using
tdd:cito iterate on fixes
Auto-approved:
ghcommands for downloading CI logs and artifacts are auto-approved. Analysis output is saved to/tmp/kagenti/rca/.
RCA Workflow
code
┌─────────────────────────────────────────────────────────────────┐ │ 1. GATHER │ │ ├─ Download CI logs │ │ ├─ Get test output │ │ └─ Check artifacts │ ├─────────────────────────────────────────────────────────────────┤ │ 2. ISOLATE │ │ ├─ Find first failure │ │ ├─ Trace error chain │ │ └─ Identify failing component │ ├─────────────────────────────────────────────────────────────────┤ │ 3. HYPOTHESIZE │ │ ├─ List possible causes │ │ ├─ Rank by likelihood │ │ └─ Identify evidence needed │ ├─────────────────────────────────────────────────────────────────┤ │ 4. VERIFY │ │ ├─ Check logs for evidence │ │ ├─ Confirm or eliminate hypotheses │ │ └─ If inconclusive → escalate to rca:hypershift │ ├─────────────────────────────────────────────────────────────────┤ │ 5. DOCUMENT │ │ ├─ Root cause identified │ │ ├─ Evidence supporting conclusion │ │ └─ Proposed fix │ └─────────────────────────────────────────────────────────────────┘
Phase 1: Gather CI Artifacts
Create working directory for analysis:
bash
mkdir -p /tmp/kagenti/rca
bash
# Get failed run ID gh run list --status failure --limit 5
bash
# Download logs to working directory gh run view <run-id> --log-failed > /tmp/kagenti/rca/failed-log.txt
bash
# View in browser for full context gh run view <run-id> --web
bash
# Download all artifacts to working directory gh run download <run-id> -D /tmp/kagenti/rca/artifacts
Phase 2: Isolate the Failure
Find First Error
bash
# Search for common error patterns in downloaded logs grep -r "error\|Error\|ERROR\|failed\|Failed\|FAILED" ./*.txt | head -20 # Find assertion failures grep -r "AssertionError\|assert\|FAILED" ./*.txt # Find the first occurrence grep -rn "error" ./*.txt | sort -t: -k2 -n | head -10
Error Chain Analysis
Work backwards from the failure:
- •What test failed?
- •What assertion failed?
- •What was the actual vs expected value?
- •What API/operation produced the wrong result?
- •What component is responsible?
Phase 3: Hypothesize Causes
Common CI Failure Categories
| Category | Signs | Check |
|---|---|---|
| Timing | "timeout", flaky results | Race conditions, slow startup |
| Config | "not found", "invalid" | Missing env vars, wrong paths |
| Auth | "401", "403", "unauthorized" | Token issues, client config |
| Network | "connection refused", "timeout" | Service not ready, DNS |
| State | Works locally, fails CI | Order dependency, cleanup |
| Resource | "OOM", "evicted" | Memory/CPU limits |
Hypothesis Template
markdown
## Hypothesis 1: [Brief description] - **Likelihood**: High/Medium/Low - **Evidence needed**: [What to look for] - **Found**: [Yes/No/Partial] - **Conclusion**: [Confirmed/Eliminated/Inconclusive]
Phase 4: Verify with Evidence
Search Patterns
bash
# Auth issues grep -i "oauth\|token\|401\|403\|unauthorized" logs/*.txt # Timing issues grep -i "timeout\|timed out\|deadline\|retry" logs/*.txt # Connection issues grep -i "connection\|refused\|unreachable\|dns" logs/*.txt # Resource issues grep -i "oom\|memory\|evict\|limit" logs/*.txt
Cross-Reference
- •Compare with last successful run
- •Check if same test passed before
- •Look for recent code changes in affected area
Phase 5: Document Findings
markdown
## Root Cause Analysis **Failure**: [Test name / description] **Run ID**: [gh run id] ### Root Cause [Clear statement of what caused the failure] ### Evidence 1. [Log line / artifact showing the issue] 2. [Supporting evidence] ### Contributing Factors - [Any secondary causes] ### Fix [Proposed solution] ### Prevention [How to prevent recurrence]
Escalation to rca:hypershift
Escalate when:
- •Logs are insufficient to determine root cause
- •Need to inspect live state (secrets, configs, pod status)
- •Need to reproduce with debugging enabled
- •Multiple hypotheses remain after log analysis
code
rca:ci inconclusive? → Create cluster → rca:hypershift
Quick Reference
| Task | Command |
|---|---|
| List failed runs | gh run list --status failure |
| View failed logs | gh run view <id> --log-failed |
| Download artifacts | gh run download <id> |
| Open in browser | gh run view <id> --web |
Related Skills
- •
rca:hypershift- RCA with live cluster access - •
tdd:ci- Fix iteration after RCA - •
superpowers:systematic-debugging- General debugging approach