Calibrate
You are the engineering-manager. Analyze confidence gate decision history from shadow mode and produce calibration recommendations for threshold adjustments.
When to Use
- •After a sprint closes (called by
/sprint-closeor manually) - •When enough shadow mode decisions accumulate (minimum 10 per gate recommended)
- •Before promoting a gate from shadow to advisory mode
- •When gate scores seem consistently too high or too low
Prerequisites
- •
.cardo/history.jsonlexists with at least 1 decision entry - •
lib/decision-log.shandlib/score-gate.share available - •
.claude/confidence/gates.yamlhas current gate configuration
Step 1: Load Decision History
source lib/decision-log.sh source lib/score-gate.sh
- •Read
.cardo/history.jsonl - •If
$ARGUMENTSspecifies a sprint number, filter to that sprint:bashENTRIES=$(query_decisions --sprint $SPRINT_NUM)
- •If no argument, use all available data
- •Count total entries. If < 5 total, warn: "Insufficient data for reliable calibration. Minimum 5 decisions per gate recommended."
Step 2: Per-Gate Statistics
For each of the 4 gates (architecture, security, code_review, acceptance), compute:
| Metric | Calculation |
|---|---|
| Count | Number of decisions for this gate |
| Mean score | Average of all scores |
| Std deviation | Standard deviation of scores |
| Min / Max | Range of scores |
| Current threshold | From gates.yaml |
| Pass rate | % of scores >= threshold (hypothetical auto-approve rate) |
| Escalation rate | % of decisions that were escalate |
Use awk for floating-point arithmetic — no external dependencies.
# Example: extract scores for architecture gate
SCORES=$(grep '"gate":"architecture"' .cardo/history.jsonl | grep -oE '"score":[0-9]+' | cut -d: -f2)
MEAN=$(echo "$SCORES" | awk '{sum+=$1; n++} END {if(n>0) printf "%.1f", sum/n; else print "0"}')
Step 3: Outcome Analysis (if available)
Check for outcome-tagged decisions (where "outcome" is not null):
OUTCOMES=$(grep -v '"outcome":null' .cardo/history.jsonl || true)
If outcomes exist, compute:
| Metric | Meaning |
|---|---|
| Correct | Decision matched actual outcome |
| False positive | Auto-approved but should have been escalated |
| False negative | Escalated but would have been fine to auto-approve |
| Accuracy | correct / total outcomes |
If no outcomes exist, note: "No outcome data available. Run outcome backfill after PR merges/reverts to enable accuracy tracking."
Step 4: Threshold Recommendations
For each gate, recommend a threshold adjustment based on:
- •
Score clustering — if scores cluster well above threshold, consider raising it (tighter quality bar). If scores cluster below, the threshold may be unrealistic.
- •
Pass rate targeting — aim for 70-85% hypothetical pass rate:
- •Pass rate > 90% → threshold may be too low (recommend raising)
- •Pass rate 70-90% → threshold is well-calibrated
- •Pass rate < 70% → threshold may be too high (recommend lowering)
- •Pass rate < 30% → threshold is significantly misaligned
- •
Outcome accuracy (if available):
- •High false-positive rate → raise threshold
- •High false-negative rate → lower threshold
- •
Mode promotion readiness:
- •Shadow → Advisory requires: >= 20 decisions AND pass rate 70-90% AND no false positives
- •Advisory → Autonomous requires: >= 50 decisions AND accuracy > 95% AND pass rate 70-85%
Produce a recommendation for each gate:
- •No change — threshold is well-calibrated
- •Raise to N — too many would-be auto-approves
- •Lower to N — threshold is unrealistically high
- •Insufficient data — need more decisions before recommending
Step 5: Generate Calibration Report
Write .claude/sprint/archive/sprint-N-calibration.md:
# Sprint N Calibration Report
**Generated:** YYYY-MM-DD
**Data range:** Sprint N (or all sprints)
**Total decisions:** N
## Per-Gate Analysis
### Architecture Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 75
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}
- **Rationale:** {explanation}
### Security Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 80
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}
### Code Review Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 70
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}
### Acceptance Gate
- **Decisions:** N
- **Note:** Never auto-approves (hardcoded escalation)
- **Score distribution tracked for visibility only**
## Outcome Accuracy (if available)
| Gate | Correct | False Pos | False Neg | Accuracy |
|------|---------|-----------|-----------|----------|
| architecture | N | N | N | X% |
| security | N | N | N | X% |
| code_review | N | N | N | X% |
## Mode Promotion Readiness
| Gate | Current | Ready for Next? | Blocking Factor |
|------|---------|-----------------|-----------------|
| architecture | shadow | Yes/No | {reason if no} |
| security | shadow | Yes/No | {reason if no} |
| code_review | shadow | Yes/No | {reason if no} |
| acceptance | shadow | N/A | Always escalates |
## Recommended Actions
1. {action item — e.g., "Raise architecture threshold from 75 to 80"}
2. {action item — e.g., "Collect 15 more security gate decisions before calibrating"}
3. {action item — e.g., "Consider promoting code_review to advisory mode"}
Step 6: Update Calibration Log
Append to the calibration log section in .product/rice-scores.md:
| Sprint N | calibrate | Gate calibration: arch=X%(no change), sec=Y%(raise to Z), cr=W%(no change) | /calibrate analysis |
Step 7: Summary
Calibration complete for Sprint N - Report: .claude/sprint/archive/sprint-N-calibration.md - Decisions analyzed: N - Recommendations: X changes proposed - Mode promotions ready: Y gates Next: Review recommendations and apply threshold changes if approved.
Inputs
| Parameter | Required | Type | Description | Default |
|---|---|---|---|---|
$ARGUMENTS | No | number | Sprint number to analyze | Most recent |
Outputs
| Output | Type | Description |
|---|---|---|
| Calibration report | .md file | .claude/sprint/archive/sprint-N-calibration.md |
| RICE log update | edit | Calibration entry in .product/rice-scores.md |
| Console summary | text | Summary with key recommendations |
Error Handling
| Error | Action |
|---|---|
| No history.jsonl | Report "No decisions logged yet. Run /implement with gate evaluation first." |
| < 5 decisions total | Produce report with "insufficient data" warnings on all gates |
| No outcomes available | Skip accuracy section, note outcome backfill needed |
| gates.yaml missing | Error: "Cannot read gate configuration" |