Calibrate

You are the engineering-manager. Analyze confidence gate decision history from shadow mode and produce calibration recommendations for threshold adjustments.

When to Use

•After a sprint closes (called by /sprint-close or manually)
•When enough shadow mode decisions accumulate (minimum 10 per gate recommended)
•Before promoting a gate from shadow to advisory mode
•When gate scores seem consistently too high or too low

Prerequisites

•.cardo/history.jsonl exists with at least 1 decision entry
•lib/decision-log.sh and lib/score-gate.sh are available
•.claude/confidence/gates.yaml has current gate configuration

Step 1: Load Decision History

bash

source lib/decision-log.sh
source lib/score-gate.sh

•Read .cardo/history.jsonl
•
If $ARGUMENTS specifies a sprint number, filter to that sprint:
bash
```
ENTRIES=$(query_decisions --sprint $SPRINT_NUM)
```
•If no argument, use all available data
•Count total entries. If < 5 total, warn: "Insufficient data for reliable calibration. Minimum 5 decisions per gate recommended."

Step 2: Per-Gate Statistics

For each of the 4 gates (architecture, security, code_review, acceptance), compute:

Metric	Calculation
Count	Number of decisions for this gate
Mean score	Average of all scores
Std deviation	Standard deviation of scores
Min / Max	Range of scores
Current threshold	From `gates.yaml`
Pass rate	% of scores >= threshold (hypothetical auto-approve rate)
Escalation rate	% of decisions that were `escalate`

Use awk for floating-point arithmetic — no external dependencies.

bash

# Example: extract scores for architecture gate
SCORES=$(grep '"gate":"architecture"' .cardo/history.jsonl | grep -oE '"score":[0-9]+' | cut -d: -f2)
MEAN=$(echo "$SCORES" | awk '{sum+=$1; n++} END {if(n>0) printf "%.1f", sum/n; else print "0"}')

Step 3: Outcome Analysis (if available)

Check for outcome-tagged decisions (where "outcome" is not null):

bash

OUTCOMES=$(grep -v '"outcome":null' .cardo/history.jsonl || true)

If outcomes exist, compute:

Metric	Meaning
Correct	Decision matched actual outcome
False positive	Auto-approved but should have been escalated
False negative	Escalated but would have been fine to auto-approve
Accuracy	correct / total outcomes

If no outcomes exist, note: "No outcome data available. Run outcome backfill after PR merges/reverts to enable accuracy tracking."

Step 4: Threshold Recommendations

For each gate, recommend a threshold adjustment based on:

•
Score clustering — if scores cluster well above threshold, consider raising it (tighter quality bar). If scores cluster below, the threshold may be unrealistic.
•
Pass rate targeting — aim for 70-85% hypothetical pass rate:
- •Pass rate > 90% → threshold may be too low (recommend raising)
- •Pass rate 70-90% → threshold is well-calibrated
- •Pass rate < 70% → threshold may be too high (recommend lowering)
- •Pass rate < 30% → threshold is significantly misaligned
•
Outcome accuracy (if available):
- •High false-positive rate → raise threshold
- •High false-negative rate → lower threshold
•
Mode promotion readiness:
- •Shadow → Advisory requires: >= 20 decisions AND pass rate 70-90% AND no false positives
- •Advisory → Autonomous requires: >= 50 decisions AND accuracy > 95% AND pass rate 70-85%

Produce a recommendation for each gate:

•No change — threshold is well-calibrated
•Raise to N — too many would-be auto-approves
•Lower to N — threshold is unrealistically high
•Insufficient data — need more decisions before recommending

Step 5: Generate Calibration Report

Write .claude/sprint/archive/sprint-N-calibration.md:

markdown

# Sprint N Calibration Report

**Generated:** YYYY-MM-DD
**Data range:** Sprint N (or all sprints)
**Total decisions:** N

## Per-Gate Analysis

### Architecture Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 75
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}
- **Rationale:** {explanation}

### Security Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 80
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}

### Code Review Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 70
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}

### Acceptance Gate
- **Decisions:** N
- **Note:** Never auto-approves (hardcoded escalation)
- **Score distribution tracked for visibility only**

## Outcome Accuracy (if available)
| Gate | Correct | False Pos | False Neg | Accuracy |
|------|---------|-----------|-----------|----------|
| architecture | N | N | N | X% |
| security | N | N | N | X% |
| code_review | N | N | N | X% |

## Mode Promotion Readiness
| Gate | Current | Ready for Next? | Blocking Factor |
|------|---------|-----------------|-----------------|
| architecture | shadow | Yes/No | {reason if no} |
| security | shadow | Yes/No | {reason if no} |
| code_review | shadow | Yes/No | {reason if no} |
| acceptance | shadow | N/A | Always escalates |

## Recommended Actions
1. {action item — e.g., "Raise architecture threshold from 75 to 80"}
2. {action item — e.g., "Collect 15 more security gate decisions before calibrating"}
3. {action item — e.g., "Consider promoting code_review to advisory mode"}

Step 6: Update Calibration Log

Append to the calibration log section in .product/rice-scores.md:

markdown

| Sprint N | calibrate | Gate calibration: arch=X%(no change), sec=Y%(raise to Z), cr=W%(no change) | /calibrate analysis |

Step 7: Summary

code

Calibration complete for Sprint N
  - Report: .claude/sprint/archive/sprint-N-calibration.md
  - Decisions analyzed: N
  - Recommendations: X changes proposed
  - Mode promotions ready: Y gates

Next: Review recommendations and apply threshold changes if approved.

Inputs

Parameter	Required	Type	Description	Default
`$ARGUMENTS`	No	number	Sprint number to analyze	Most recent

Outputs

Output	Type	Description
Calibration report	`.md` file	`.claude/sprint/archive/sprint-N-calibration.md`
RICE log update	edit	Calibration entry in `.product/rice-scores.md`
Console summary	text	Summary with key recommendations

Error Handling

Error	Action
No history.jsonl	Report "No decisions logged yet. Run /implement with gate evaluation first."
< 5 decisions total	Produce report with "insufficient data" warnings on all gates
No outcomes available	Skip accuracy section, note outcome backfill needed
gates.yaml missing	Error: "Cannot read gate configuration"