AgentSkillsCN

Evaluation

评估

SKILL.md

Agent Evaluation

Systematic evaluation of agent systems for quality assurance and continuous improvement.

When to Use

  • Setting up quality gates for agent deployments
  • Measuring agent performance after changes
  • Building test frameworks for agent behavior
  • Debugging inconsistent agent results
  • Comparing different agent configurations
  • Establishing baselines before optimization

Core Challenge

Agent evaluation differs fundamentally from standard software testing:

  1. Non-determinism - Agents may reach identical goals through different paths
  2. Context dependency - Failures emerge subtly based on conversation history
  3. No single correct answer - Multiple valid responses exist for most tasks
  4. Composite quality - Performance isn't unidimensional

The 95% Finding

Research on agent evaluation identified three factors explaining performance variance:

FactorVariance Explained
Token usage~80%
Tool call frequency~10%
Model selection~5%

Key insight: Context budgets matter more than architectural choices early in development. Optimize token usage before changing models or architectures.

Multi-Dimensional Rubric Framework

Don't reduce quality to a single score. Assess multiple dimensions:

Recommended Dimensions

DimensionWhat It Measures
Factual AccuracyClaims match ground truth
CompletenessOutput covers all requested aspects
EfficiencyReasonable tool usage, no unnecessary steps
Process QualityFollowed best practices, clean execution
Code QualityIf code: passes tests, follows patterns, maintainable
SafetyNo security issues, respects constraints

Scoring Levels

For each dimension, define clear levels:

code
Excellent (1.0): Exceeds requirements, no issues
Good (0.8): Meets requirements, minor issues
Acceptable (0.6): Mostly meets requirements, some gaps
Poor (0.4): Significant gaps, partial completion
Failed (0.0): Did not meet requirements

Weight dimensions based on your use case. Safety might be pass/fail while completeness is graded.

Evaluation Methods

1. LLM-as-Judge

Use another LLM to evaluate agent outputs. Scales to large test sets.

Prompt structure:

markdown
## Task Description

{original_task}

## Ground Truth (if available)

{expected_outcome}

## Agent Output

{agent_response}

## Evaluation Criteria

- Factual accuracy: Are claims correct?
- Completeness: Are all aspects addressed?
- Process quality: Was execution clean?

Rate each dimension 0.0-1.0 with brief justification.

Limitations:

  • Can miss subtle hallucinations
  • May not catch domain-specific errors
  • Supplement with human review for critical systems

2. Human Evaluation

Essential for catching what automation misses:

  • Unusual hallucination patterns
  • System-level failures
  • Subtle biases
  • UX issues in agent responses

When to use:

  • Production sampling (random % of interactions)
  • New capability launches
  • After significant changes
  • Edge cases and failures

3. End-State Evaluation

For state-mutating agents (file edits, deployments), focus on final state:

python
# Instead of evaluating execution steps:
assert agent.steps == expected_steps  # Brittle

# Evaluate the outcome:
assert file_exists("output.ts")
assert tests_pass("output.ts")
assert no_lint_errors("output.ts")

4. Regression Testing

Track metrics over time to catch degradation:

yaml
baseline:
  accuracy: 0.92
  completeness: 0.88
  efficiency: 0.85

current:
  accuracy: 0.89 # -3% - investigate
  completeness: 0.90 # +2% - good
  efficiency: 0.82 # -3% - acceptable variance

Test Set Design

Stratify by Complexity

LevelCharacteristics% of Set
SimpleSingle tool, clear goal30%
MediumMultiple tools, some ambiguity40%
ComplexMany tools, requires planning20%
Edge casesKnown failure modes10%

Source Test Cases From

  • Real user interactions (anonymized)
  • Known edge cases and failure modes
  • Adversarial examples
  • Regression cases (previous failures)

Sample Size Guidelines

  • Development: Small samples (10-20) - changes show large effects
  • Pre-release: Medium samples (50-100) - statistical significance
  • Production monitoring: Continuous random sampling (1-5%)

Context Engineering Validation

Test how agents perform under different context conditions:

yaml
tests:
  - name: "Fresh context"
    context_size: 10%
    expected_accuracy: 0.95

  - name: "Moderate context"
    context_size: 50%
    expected_accuracy: 0.90

  - name: "Near limit"
    context_size: 80%
    expected_accuracy: 0.80 # Identify degradation cliff

Harness Integration

Per-Ticket Evaluation

After each ticket completion:

markdown
## Ticket Evaluation: P1-T003

### Dimensions

- Accuracy: 1.0 - All requirements met
- Completeness: 0.8 - Missing edge case handling
- Efficiency: 0.9 - Clean execution, minimal retries
- Test Coverage: 1.0 - All paths tested
- Code Quality: 0.9 - Follows patterns, minor style issue

### Overall: PASS (0.92 weighted average)

Phase Gate Evaluation

Before moving to next phase:

markdown
## Phase 1 Evaluation Summary

Tickets completed: 12/12
Average accuracy: 0.91
Average completeness: 0.88
Test coverage: 94%
Security issues: 0
Blocking issues: 0

### Decision: PROCEED to Phase 2

Production Monitoring

Continuous evaluation of deployed agent:

yaml
alerts:
  - metric: accuracy
    threshold: 0.85
    action: page_on_call

  - metric: error_rate
    threshold: 0.05
    action: create_incident

dashboards:
  - daily_quality_scores
  - weekly_trend_analysis
  - failure_categorization

Common Pitfalls

PitfallSolution
Overfitting to execution pathsEvaluate outcomes, not steps
Single-metric obsessionUse multi-dimensional rubrics
Ignoring edge casesStratify test sets by complexity
Skipping human reviewSample production for human eval
No baselineEstablish metrics before changes
Evaluating only successesInclude failure analysis

Implementation Guidelines

  1. Define quality dimensions relevant to your use case
  2. Create rubrics with actionable level descriptions
  3. Build test sets from real patterns plus edge cases
  4. Establish baselines before making changes
  5. Automate evaluation pipelines for consistency
  6. Supplement with human review for critical paths
  7. Track metrics longitudinally for trend detection
  8. Set quality gates that block bad deployments
  9. Categorize failures to identify patterns
  10. Iterate on rubrics as you learn more

References