AgentSkillsCN

evaluator-optimizer

利用评估者-优化器反馈循环实现迭代改进。 这是Anthropic提出的六大可组合模式之一,旨在打造高效的智能体。 适用场景: - 当输出质量至关重要,且通过迭代能够持续提升成果时; - 当您拥有清晰的评估标准时; - 当用户希望“精炼”“改进”“迭代”时; - 当您需要生成复杂的输出,例如文档、设计方案或算法时; - 当多轮优化在可接受的延迟范围内时(权衡性能与响应时间)。 触发短语:迭代、精炼、提升质量、反馈循环、评估并优化、持续改进。

SKILL.md
--- frontmatter
name: evaluator-optimizer
description: |
  Iterative improvement pattern using evaluator-optimizer feedback loops.
  One of Anthropic's 6 composable patterns for building effective agents.

  Use when:
  - Output quality matters and iteration improves results
  - Clear evaluation criteria exist
  - User wants "refine", "improve", "iterate"
  - Complex outputs like documentation, designs, or algorithms
  - Multi-round optimization is acceptable (latency trade-off)

  Trigger phrases: iterate, refine, improve quality, feedback loop, evaluate and optimize, keep improving
allowed-tools: Read, Write, Edit, Glob, Grep, Task, TodoWrite, AskUserQuestion
model: sonnet
user-invocable: true

Evaluator-Optimizer Pattern

An iterative improvement pattern where one agent generates output while another evaluates and provides feedback, continuing until quality criteria are met.

From Building Effective Agents:

"One LLM generates responses while another evaluates and provides feedback iteratively. Effective for literary translation refinement and multi-round search tasks requiring judgment on whether further investigation is warranted."

Core Concept

code
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ┌───────────┐    output    ┌───────────┐                      │
│  │           │─────────────▶│           │                      │
│  │ GENERATOR │              │ EVALUATOR │                      │
│  │  (Agent)  │◀─────────────│  (Agent)  │                      │
│  │           │   feedback   │           │                      │
│  └───────────┘              └───────────┘                      │
│       │                           │                            │
│       │ (if approved)             │                            │
│       ▼                           │                            │
│  ┌─────────┐                      │                            │
│  │ FINAL   │◀─────────────────────┘                            │
│  │ OUTPUT  │     (pass/fail + score)                           │
│  └─────────┘                                                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Use

Good Fit

ScenarioWhy Evaluator-Optimizer Works
DocumentationClarity, completeness can be iteratively improved
Code refactoringQuality metrics guide optimization
API designUsability and consistency refinable
UI/UX copyTone, clarity, engagement tunable
Complex algorithmsPerformance/correctness verifiable

Poor Fit

ScenarioWhy Not
Simple CRUDOverhead not justified
Time-critical tasksIteration adds latency
No clear criteriaCan't evaluate effectively
Binary correct/incorrectSingle pass sufficient

Implementation Pattern

1. Generator Agent

Creates initial output based on requirements.

markdown
## Generator Task

Create: [description of output]
Requirements: [specific requirements]
Format: [expected format]

## Output Format

Provide your output with clear sections:
- Main content
- Self-assessment of quality
- Areas of uncertainty

2. Evaluator Agent

Reviews output against criteria and provides actionable feedback.

markdown
## Evaluator Task

Review the following output:
[generator output]

Evaluation Criteria:
1. Correctness: Does it meet the requirements?
2. Completeness: Are all aspects covered?
3. Clarity: Is it understandable?
4. Quality: Does it follow best practices?

## Output Format

Score: [0-100]
Pass: [true/false] (threshold: 80)

Strengths:
- [what works well]

Issues:
- [Issue 1]: [specific problem]
  Fix: [actionable improvement]
- [Issue 2]: [specific problem]
  Fix: [actionable improvement]

Verdict: [PASS / NEEDS_REVISION]

3. Optimization Loop

python
# Conceptual flow
max_iterations = 3
iteration = 0
passed = False

while iteration < max_iterations and not passed:
    # Generate
    output = generator.create(requirements, feedback)

    # Evaluate
    evaluation = evaluator.review(output, criteria)

    if evaluation.score >= threshold:
        passed = True
    else:
        feedback = evaluation.issues
        iteration += 1

return output, evaluation

Subagent Configuration

Generator Subagent

yaml
name: generator
model: sonnet  # or inherit for complex tasks
tools: Read, Write, Edit, Glob, Grep
permissionMode: acceptEdits

Evaluator Subagent

yaml
name: evaluator
model: sonnet  # Use same or stronger model
tools: Read, Glob, Grep  # Read-only evaluation
permissionMode: plan
disallowedTools: Write, Edit, Bash

Practical Examples

Example 1: Documentation Optimization

markdown
## Iteration 1

### Generator Output
```typescript
/**
 * Creates a user
 * @param data - user data
 */
function createUser(data: UserData): User { ... }

Evaluator Feedback

Score: 65 Issues:

  • Missing: Parameter documentation incomplete
  • Missing: Return value description
  • Missing: Error conditions
  • Missing: Example usage

Iteration 2

Generator Output (Improved)

typescript
/**
 * Creates a new user account with the provided data.
 *
 * @param data - The user creation data
 * @param data.email - Valid email address (must be unique)
 * @param data.password - Password (min 8 chars, hashed before storage)
 * @param data.name - Display name (optional)
 *
 * @returns The created User object with generated ID
 *
 * @throws {DuplicateEmailError} If email already exists
 * @throws {ValidationError} If data fails validation
 *
 * @example
 * const user = await createUser({
 *   email: 'john@example.com',
 *   password: 'securePass123'
 * });
 */
function createUser(data: UserData): Promise<User> { ... }

Evaluator Feedback

Score: 92 Verdict: PASS

code

### Example 2: Algorithm Optimization

```markdown
## Iteration 1

### Generator Output
function findDuplicates(arr) {
  const duplicates = [];
  for (let i = 0; i < arr.length; i++) {
    for (let j = i + 1; j < arr.length; j++) {
      if (arr[i] === arr[j] && !duplicates.includes(arr[i])) {
        duplicates.push(arr[i]);
      }
    }
  }
  return duplicates;
}

### Evaluator Feedback
Score: 55
Issues:
- Performance: O(n²) time complexity, O(n) for includes
- Correctness: Works but inefficient
Fix: Use Set or Map for O(n) solution

### Iteration 2

### Generator Output (Improved)
function findDuplicates(arr) {
  const seen = new Set();
  const duplicates = new Set();

  for (const item of arr) {
    if (seen.has(item)) {
      duplicates.add(item);
    } else {
      seen.add(item);
    }
  }

  return [...duplicates];
}

### Evaluator Feedback
Score: 95
Verdict: PASS
Strengths:
- O(n) time complexity
- O(n) space complexity (optimal for this problem)
- Clean, readable implementation

Integration with Plan/Review/Implement Workflow

During /spec-plan (Architecture Design) and /spec-implement (Quality Review):

Architecture Design Optimization (/spec-plan)

code
1. code-architect generates initial design
2. Evaluator checks against:
   - Codebase patterns
   - Scalability requirements
   - Security considerations
3. Iterate until design scores >= 85

Quality Review Optimization (/spec-implement)

code
1. Implementation complete
2. qa-engineer evaluates test coverage
3. If coverage < 80%, iterate:
   - Identify gaps
   - Add tests
   - Re-evaluate

Stopping Conditions

Must Stop When

ConditionAction
Score >= thresholdAccept output
Max iterations reachedReturn best attempt with warning
Evaluator stuck in loopBreak with human review request
Fundamental flaw detectedEscalate to user

Recommended Limits

ContextMax IterationsScore Threshold
Documentation380
Code quality385
Algorithm490
Security-critical595

Evaluation Metrics

From Anthropic's "Demystifying evals for AI agents" engineering blog:

Key Metrics for Non-Deterministic Evaluation

MetricFormulaUse Case
pass@kP(at least 1 success in k trials)"Can it succeed?"
pass^kP(all k trials succeed)"Is it consistent?"

Interpreting Metrics

code
pass@k = 1 - (1 - p)^k  where p = per-trial success rate

Example with p = 0.7:
- pass@1 = 0.70  (70% chance of success on single try)
- pass@3 = 0.97  (97% chance at least one succeeds)
- pass^3 = 0.34  (34% chance all three succeed)

Use pass@k for evaluating capability (can the agent do this task?) Use pass^k for evaluating reliability (will the agent consistently do this?)

Three Types of Graders

Grader TypeProsConsBest For
Code-basedFast, cheap, objectiveBrittle to valid variationsFormat validation, syntax checks
Model-basedFlexible, scalableNon-deterministic, needs calibrationNuanced quality assessment
HumanGold-standard qualityExpensive, slowFinal validation, edge cases

Grader Selection Strategy

code
1. Start with code-based graders for objective criteria
   - JSON schema validation
   - Required field presence
   - Format compliance

2. Add model-based graders for subjective criteria
   - Code quality assessment
   - Documentation clarity
   - Design appropriateness

3. Reserve human graders for:
   - Calibrating model-based graders
   - Edge case evaluation
   - Final sign-off on critical outputs

Evaluation Best Practices

PracticeDescription
Start earlyBegin with 20-50 tasks from real failures, not 100+ perfect tasks
Grade outcomesEvaluate results, not specific solution paths
Avoid class imbalanceBalance positive and negative cases
Read transcriptsRegularly verify graders measure what matters
Monitor saturationAdd harder tasks when current ones are consistently passed

Anti-Patterns

Anti-PatternWhy BadInstead
No stopping conditionInfinite loop riskSet max iterations
Same model evaluates own outputBias toward approvalUse separate agent
Vague criteriaCan't convergeDefine specific rubrics
Ignoring feedbackNo improvementGenerator must address issues
Over-optimizingDiminishing returnsAccept "good enough"

Advanced: Multi-Evaluator

For complex outputs, use multiple specialized evaluators:

code
Generator Output
      │
      ├──▶ Correctness Evaluator
      │
      ├──▶ Style Evaluator
      │
      ├──▶ Performance Evaluator
      │
      └──▶ Security Evaluator

Combined Score = weighted average
Feedback = aggregated from all evaluators

Rules (L1 - Hard)

Critical for effective optimization loops.

  • ALWAYS define clear evaluation criteria before starting (otherwise cannot converge)
  • ALWAYS set maximum iteration limits (prevent infinite loops)
  • NEVER let generator evaluate its own output (bias toward approval)
  • NEVER ignore evaluation feedback in subsequent iterations

Defaults (L2 - Soft)

Important for quality results. Override with reasoning when appropriate.

  • Provide actionable feedback (not just "needs improvement")
  • Track iteration count and score progression
  • Return best attempt if max iterations reached
  • Use separate agent instances for generator and evaluator

Guidelines (L3)

Recommendations for better optimization.

  • Consider using multiple specialized evaluators for complex outputs
  • Prefer score thresholds of 80+ for production-quality outputs
  • Consider diminishing returns beyond 3-4 iterations