AgentSkillsCN

calibrate

分析影子模式下的决策历史,提出调整门限值的建议

SKILL.md
--- frontmatter
name: calibrate
description: Analyze shadow mode decision history and recommend gate threshold adjustments
argument-hint: "[sprint-number]" (optional — defaults to most recent sprint)

Calibrate

You are the engineering-manager. Analyze confidence gate decision history from shadow mode and produce calibration recommendations for threshold adjustments.

When to Use

  • After a sprint closes (called by /sprint-close or manually)
  • When enough shadow mode decisions accumulate (minimum 10 per gate recommended)
  • Before promoting a gate from shadow to advisory mode
  • When gate scores seem consistently too high or too low

Prerequisites

  • .cardo/history.jsonl exists with at least 1 decision entry
  • lib/decision-log.sh and lib/score-gate.sh are available
  • .claude/confidence/gates.yaml has current gate configuration

Step 1: Load Decision History

bash
source lib/decision-log.sh
source lib/score-gate.sh
  1. Read .cardo/history.jsonl
  2. If $ARGUMENTS specifies a sprint number, filter to that sprint:
    bash
    ENTRIES=$(query_decisions --sprint $SPRINT_NUM)
    
  3. If no argument, use all available data
  4. Count total entries. If < 5 total, warn: "Insufficient data for reliable calibration. Minimum 5 decisions per gate recommended."

Step 2: Per-Gate Statistics

For each of the 4 gates (architecture, security, code_review, acceptance), compute:

MetricCalculation
CountNumber of decisions for this gate
Mean scoreAverage of all scores
Std deviationStandard deviation of scores
Min / MaxRange of scores
Current thresholdFrom gates.yaml
Pass rate% of scores >= threshold (hypothetical auto-approve rate)
Escalation rate% of decisions that were escalate

Use awk for floating-point arithmetic — no external dependencies.

bash
# Example: extract scores for architecture gate
SCORES=$(grep '"gate":"architecture"' .cardo/history.jsonl | grep -oE '"score":[0-9]+' | cut -d: -f2)
MEAN=$(echo "$SCORES" | awk '{sum+=$1; n++} END {if(n>0) printf "%.1f", sum/n; else print "0"}')

Step 3: Outcome Analysis (if available)

Check for outcome-tagged decisions (where "outcome" is not null):

bash
OUTCOMES=$(grep -v '"outcome":null' .cardo/history.jsonl || true)

If outcomes exist, compute:

MetricMeaning
CorrectDecision matched actual outcome
False positiveAuto-approved but should have been escalated
False negativeEscalated but would have been fine to auto-approve
Accuracycorrect / total outcomes

If no outcomes exist, note: "No outcome data available. Run outcome backfill after PR merges/reverts to enable accuracy tracking."

Step 4: Threshold Recommendations

For each gate, recommend a threshold adjustment based on:

  1. Score clustering — if scores cluster well above threshold, consider raising it (tighter quality bar). If scores cluster below, the threshold may be unrealistic.

  2. Pass rate targeting — aim for 70-85% hypothetical pass rate:

    • Pass rate > 90% → threshold may be too low (recommend raising)
    • Pass rate 70-90% → threshold is well-calibrated
    • Pass rate < 70% → threshold may be too high (recommend lowering)
    • Pass rate < 30% → threshold is significantly misaligned
  3. Outcome accuracy (if available):

    • High false-positive rate → raise threshold
    • High false-negative rate → lower threshold
  4. Mode promotion readiness:

    • Shadow → Advisory requires: >= 20 decisions AND pass rate 70-90% AND no false positives
    • Advisory → Autonomous requires: >= 50 decisions AND accuracy > 95% AND pass rate 70-85%

Produce a recommendation for each gate:

  • No change — threshold is well-calibrated
  • Raise to N — too many would-be auto-approves
  • Lower to N — threshold is unrealistically high
  • Insufficient data — need more decisions before recommending

Step 5: Generate Calibration Report

Write .claude/sprint/archive/sprint-N-calibration.md:

markdown
# Sprint N Calibration Report

**Generated:** YYYY-MM-DD
**Data range:** Sprint N (or all sprints)
**Total decisions:** N

## Per-Gate Analysis

### Architecture Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 75
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}
- **Rationale:** {explanation}

### Security Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 80
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}

### Code Review Gate
- **Decisions:** N
- **Score range:** min — max (mean ± stddev)
- **Current threshold:** 70
- **Hypothetical pass rate:** X%
- **Recommendation:** {No change | Raise to N | Lower to N}

### Acceptance Gate
- **Decisions:** N
- **Note:** Never auto-approves (hardcoded escalation)
- **Score distribution tracked for visibility only**

## Outcome Accuracy (if available)
| Gate | Correct | False Pos | False Neg | Accuracy |
|------|---------|-----------|-----------|----------|
| architecture | N | N | N | X% |
| security | N | N | N | X% |
| code_review | N | N | N | X% |

## Mode Promotion Readiness
| Gate | Current | Ready for Next? | Blocking Factor |
|------|---------|-----------------|-----------------|
| architecture | shadow | Yes/No | {reason if no} |
| security | shadow | Yes/No | {reason if no} |
| code_review | shadow | Yes/No | {reason if no} |
| acceptance | shadow | N/A | Always escalates |

## Recommended Actions
1. {action item — e.g., "Raise architecture threshold from 75 to 80"}
2. {action item — e.g., "Collect 15 more security gate decisions before calibrating"}
3. {action item — e.g., "Consider promoting code_review to advisory mode"}

Step 6: Update Calibration Log

Append to the calibration log section in .product/rice-scores.md:

markdown
| Sprint N | calibrate | Gate calibration: arch=X%(no change), sec=Y%(raise to Z), cr=W%(no change) | /calibrate analysis |

Step 7: Summary

code
Calibration complete for Sprint N
  - Report: .claude/sprint/archive/sprint-N-calibration.md
  - Decisions analyzed: N
  - Recommendations: X changes proposed
  - Mode promotions ready: Y gates

Next: Review recommendations and apply threshold changes if approved.

Inputs

ParameterRequiredTypeDescriptionDefault
$ARGUMENTSNonumberSprint number to analyzeMost recent

Outputs

OutputTypeDescription
Calibration report.md file.claude/sprint/archive/sprint-N-calibration.md
RICE log updateeditCalibration entry in .product/rice-scores.md
Console summarytextSummary with key recommendations

Error Handling

ErrorAction
No history.jsonlReport "No decisions logged yet. Run /implement with gate evaluation first."
< 5 decisions totalProduce report with "insufficient data" warnings on all gates
No outcomes availableSkip accuracy section, note outcome backfill needed
gates.yaml missingError: "Cannot read gate configuration"