Calibrate - Post-Launch AI Feature Calibration
Core Philosophy
Calibration happens after launch, not before.
The mistake: Building elaborate systems to perfectly calibrate AI behavior before launch. The reality: You learn what quality means by shipping to users and seeing what they actually need.
The Calibration Loop:
- •Deploy at current agency level
- •Monitor performance in prod
- •Analyze and learn
- •Calibrate system
- •Test changes
- •Consider agency increase
- •Repeat
Entry Point
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CALIBRATE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Calibration happens after launch, not before.
What do you need?
1. Document error patterns
→ Analyze failures, categorize, plan fixes
2. Review eval performance
→ Are evals catching real issues? Missing patterns?
3. Agency promotion decision
→ Is this feature ready for more autonomy?
4. Quick calibration check
→ Is the system behaving as expected?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Parse intent from context:
- •If user mentions "errors" or "failures" or "bugs" → Flow 1
- •If user mentions "evals" or "tests" or "coverage" → Flow 2
- •If user mentions "promote" or "increase" or "V2" → Flow 3
- •If user mentions "check" or "status" or "quick" → Flow 4
Command-line shortcuts:
- •
/calibrate→ Show entry menu - •
/calibrate --errors→ Flow 1 (error patterns) - •
/calibrate --evals→ Flow 2 (eval review) - •
/calibrate --promote→ Flow 3 (agency promotion) - •
/calibrate --quick→ Flow 4 (quick check)
Flow 1: Document Error Patterns
Step 1: Gather Error Data
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ERROR PATTERN DOCUMENTATION ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Let's catalog what's going wrong. Where are you seeing errors? • User feedback / complaints • Support tickets • Monitoring alerts • Manual review • User corrections / overrides ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
- •"What specific failures have you observed?"
- •"How often does this happen? (rare, occasional, frequent)"
- •"What's the impact when it fails?"
- •"Are there patterns in WHEN it fails?"
Step 2: Categorize Errors
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ERROR CATEGORIES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Common error categories for AI features: • Hallucination - AI makes things up • Wrong context - AI misreads the situation • Tone mismatch - Output style is wrong • Scope creep - AI goes beyond boundaries • Missing information - AI lacks needed context • Confidence miscalibration - Too certain or uncertain • Edge case - Scenario not covered • Adversarial - Intentional misuse ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 3: Analyze Root Causes
For each error pattern, determine:
- •Likely reason: Why is this happening?
- •Potential fix: Prompt change? Context? Guardrail? Training data?
- •Priority: P1 (critical), P2 (important), P3 (nice to fix)
Output: Error Pattern Table
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ERROR PATTERN ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Feature: [name] Analysis Date: [date] Data Source: [where errors were observed] | Error Pattern | Category | Likely Reason | Potential Fix | Priority | |---------------|----------|---------------|---------------|----------| | [description] | [type] | [why] | [how to fix] | P1 | | [description] | [type] | [why] | [how to fix] | P2 | | [description] | [type] | [why] | [how to fix] | P3 | PATTERN ANALYSIS: - Most common category: [X] - Emerging pattern: [Y] - Regression from last period: [Z] RECOMMENDED ACTIONS: 1. [P1 action] 2. [P2 action] 3. [P3 action] ADD TO EVALS: - [ ] Add test case for [error pattern 1] - [ ] Add test case for [error pattern 2] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 2: Review Eval Performance
Step 1: Current Eval State
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ EVAL PERFORMANCE REVIEW ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Let's see if your evals are working. Current state: • How many test cases do you have? • What's your pass rate? • When did you last update evals? • Are you seeing failures in prod that evals missed? ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
- •"Are current evals catching real issues?"
- •"What new patterns have emerged since launch?"
- •"Are you passing 100%?" (If yes, evals may be too easy)
- •"How often do you run evals?"
Step 2: Gap Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ EVAL GAP ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Check coverage across categories: □ Happy path - Common successful scenarios □ Edge cases - Unusual but valid inputs □ Adversarial - Intentional misuse attempts □ Boundary - Out of scope handling □ Regression - Previously fixed issues □ Production errors - Real failures observed Missing categories = gaps in coverage ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output: Eval Assessment
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ EVAL ASSESSMENT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Feature: [name] Test Cases: [count] Pass Rate: [%] Last Updated: [date] COVERAGE ASSESSMENT: | Category | Coverage | Status | |----------|----------|--------| | Happy path | [X] cases | ✅ Good | | Edge cases | [X] cases | ⚠️ Needs work | | Adversarial | [X] cases | ❌ Missing | | Boundary | [X] cases | ✅ Good | | Regression | [X] cases | ⚠️ Needs work | EFFECTIVENESS: - Catching real issues? [Yes/No/Partially] - False positive rate: [%] - Prod errors missed: [list] RECOMMENDATIONS: 1. Add [X] test cases for [gap] 2. Update [Y] tests that are stale 3. Remove [Z] tests that are redundant ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 3: Agency Promotion Decision
Promotion Checklist
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ AGENCY PROMOTION CHECK ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Considering: V[current] → V[target] Let's verify readiness. QUALITY METRICS □ Accuracy/quality stable for 4+ weeks? □ No new error patterns emerging? □ User corrections decreasing? SAFETY & TRUST □ Confident in known failure modes? □ Override mechanism working well? □ User feedback positive? OPERATIONAL READINESS □ Monitoring in place for new level? □ Rollback plan ready? □ Team aligned on promotion? ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
For each item, ask:
- •"What's your evidence?"
- •"How long have you observed this?"
- •"What would change your answer?"
Output: Promotion Verdict
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PROMOTION VERDICT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Feature: [name] Current: V[n] | Target: V[n+1] VERDICT: [READY ✅ / NOT READY ❌ / NEEDS WORK ⚠️] ✅ PASSING: - [criteria met with evidence] - [criteria met with evidence] ❌ BLOCKING: - [criteria not met + what's needed] - [criteria not met + what's needed] ⚠️ RISKS IF PROMOTED NOW: - [risk + mitigation needed] RECOMMENDATION: [Clear recommendation with reasoning] NEXT STEPS: 1. [action if ready] 2. [action if not ready] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Flow 4: Quick Calibration Check
Health Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ QUICK CALIBRATION CHECK ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Fast health check for [feature name] • Current quality metric: [X%] • Trend: [↑ improving / → stable / ↓ degrading] • Any alerts triggered? [Y/N] • User feedback signals: [positive/neutral/negative] • Override rate: [X%] (is this expected?) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output: Health Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ CALIBRATION STATUS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Feature: [name] Agency Level: V[n] Check Date: [date] STATUS: [HEALTHY ✅ / ATTENTION ⚠️ / DEGRADED ❌] METRICS: | Metric | Value | Trend | Status | |--------|-------|-------|--------| | Quality | [X%] | [→] | ✅ | | Override rate | [X%] | [↓] | ✅ | | User satisfaction | [X] | [→] | ⚠️ | ALERTS: - [any triggered alerts] ACTION NEEDED: - [none / specific action] NEXT CHECK: [recommended cadence] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Calibration Cadence
Weekly: Quick check (Flow 4)
- •Is quality stable?
- •Any new alerts?
- •User feedback signals?
Monthly: Eval review (Flow 2)
- •Are evals catching real issues?
- •New patterns to add?
- •Stale tests to update?
Quarterly: Deep calibration
- •Error pattern analysis (Flow 1)
- •Agency promotion consideration (Flow 3)
- •Strategic calibration review
Challenge Patterns
"Quality looks fine" → "What's your correction rate? What's the trend? Fine compared to what baseline?"
"No complaints" → "Are users actually using it? Are they silently working around it?"
"Ready to promote" → "Show me the data. How long has quality been stable? What failure modes have you validated?"
"Evals are passing" → "100% pass rate? That might mean evals are too easy. When did you last add new test cases?"
Integration with Other Commands
Before /calibrate:
- •
/agency-ladder- Define the ladder first - •
/spec --ai- Ensure spec includes calibration plan
Related:
- •
/ai-health-check- Pre-launch validation - •
/start-evals- Set up eval infrastructure
Attribution
Framework: CC/CD (Continuous Calibration/Continuous Development) Source: Aishwarya Naresh Reganti & Kiriti Badam (Lenny's Newsletter) Adaptation: Post-launch calibration workflows
Remember
- •Calibration is continuous, not one-time
- •Error patterns are the richest signal
- •User corrections show where AI fails
- •Evals must evolve with the product
- •Promotion requires proven reliability
- •100% pass rate means evals are too easy