Agent Evaluation
Systematic evaluation of agent systems for quality assurance and continuous improvement.
When to Use
- •Setting up quality gates for agent deployments
- •Measuring agent performance after changes
- •Building test frameworks for agent behavior
- •Debugging inconsistent agent results
- •Comparing different agent configurations
- •Establishing baselines before optimization
Core Challenge
Agent evaluation differs fundamentally from standard software testing:
- •Non-determinism - Agents may reach identical goals through different paths
- •Context dependency - Failures emerge subtly based on conversation history
- •No single correct answer - Multiple valid responses exist for most tasks
- •Composite quality - Performance isn't unidimensional
The 95% Finding
Research on agent evaluation identified three factors explaining performance variance:
| Factor | Variance Explained |
|---|---|
| Token usage | ~80% |
| Tool call frequency | ~10% |
| Model selection | ~5% |
Key insight: Context budgets matter more than architectural choices early in development. Optimize token usage before changing models or architectures.
Multi-Dimensional Rubric Framework
Don't reduce quality to a single score. Assess multiple dimensions:
Recommended Dimensions
| Dimension | What It Measures |
|---|---|
| Factual Accuracy | Claims match ground truth |
| Completeness | Output covers all requested aspects |
| Efficiency | Reasonable tool usage, no unnecessary steps |
| Process Quality | Followed best practices, clean execution |
| Code Quality | If code: passes tests, follows patterns, maintainable |
| Safety | No security issues, respects constraints |
Scoring Levels
For each dimension, define clear levels:
Excellent (1.0): Exceeds requirements, no issues Good (0.8): Meets requirements, minor issues Acceptable (0.6): Mostly meets requirements, some gaps Poor (0.4): Significant gaps, partial completion Failed (0.0): Did not meet requirements
Weight dimensions based on your use case. Safety might be pass/fail while completeness is graded.
Evaluation Methods
1. LLM-as-Judge
Use another LLM to evaluate agent outputs. Scales to large test sets.
Prompt structure:
## Task Description
{original_task}
## Ground Truth (if available)
{expected_outcome}
## Agent Output
{agent_response}
## Evaluation Criteria
- Factual accuracy: Are claims correct?
- Completeness: Are all aspects addressed?
- Process quality: Was execution clean?
Rate each dimension 0.0-1.0 with brief justification.
Limitations:
- •Can miss subtle hallucinations
- •May not catch domain-specific errors
- •Supplement with human review for critical systems
2. Human Evaluation
Essential for catching what automation misses:
- •Unusual hallucination patterns
- •System-level failures
- •Subtle biases
- •UX issues in agent responses
When to use:
- •Production sampling (random % of interactions)
- •New capability launches
- •After significant changes
- •Edge cases and failures
3. End-State Evaluation
For state-mutating agents (file edits, deployments), focus on final state:
# Instead of evaluating execution steps:
assert agent.steps == expected_steps # Brittle
# Evaluate the outcome:
assert file_exists("output.ts")
assert tests_pass("output.ts")
assert no_lint_errors("output.ts")
4. Regression Testing
Track metrics over time to catch degradation:
baseline: accuracy: 0.92 completeness: 0.88 efficiency: 0.85 current: accuracy: 0.89 # -3% - investigate completeness: 0.90 # +2% - good efficiency: 0.82 # -3% - acceptable variance
Test Set Design
Stratify by Complexity
| Level | Characteristics | % of Set |
|---|---|---|
| Simple | Single tool, clear goal | 30% |
| Medium | Multiple tools, some ambiguity | 40% |
| Complex | Many tools, requires planning | 20% |
| Edge cases | Known failure modes | 10% |
Source Test Cases From
- •Real user interactions (anonymized)
- •Known edge cases and failure modes
- •Adversarial examples
- •Regression cases (previous failures)
Sample Size Guidelines
- •Development: Small samples (10-20) - changes show large effects
- •Pre-release: Medium samples (50-100) - statistical significance
- •Production monitoring: Continuous random sampling (1-5%)
Context Engineering Validation
Test how agents perform under different context conditions:
tests:
- name: "Fresh context"
context_size: 10%
expected_accuracy: 0.95
- name: "Moderate context"
context_size: 50%
expected_accuracy: 0.90
- name: "Near limit"
context_size: 80%
expected_accuracy: 0.80 # Identify degradation cliff
Harness Integration
Per-Ticket Evaluation
After each ticket completion:
## Ticket Evaluation: P1-T003 ### Dimensions - Accuracy: 1.0 - All requirements met - Completeness: 0.8 - Missing edge case handling - Efficiency: 0.9 - Clean execution, minimal retries - Test Coverage: 1.0 - All paths tested - Code Quality: 0.9 - Follows patterns, minor style issue ### Overall: PASS (0.92 weighted average)
Phase Gate Evaluation
Before moving to next phase:
## Phase 1 Evaluation Summary Tickets completed: 12/12 Average accuracy: 0.91 Average completeness: 0.88 Test coverage: 94% Security issues: 0 Blocking issues: 0 ### Decision: PROCEED to Phase 2
Production Monitoring
Continuous evaluation of deployed agent:
alerts:
- metric: accuracy
threshold: 0.85
action: page_on_call
- metric: error_rate
threshold: 0.05
action: create_incident
dashboards:
- daily_quality_scores
- weekly_trend_analysis
- failure_categorization
Common Pitfalls
| Pitfall | Solution |
|---|---|
| Overfitting to execution paths | Evaluate outcomes, not steps |
| Single-metric obsession | Use multi-dimensional rubrics |
| Ignoring edge cases | Stratify test sets by complexity |
| Skipping human review | Sample production for human eval |
| No baseline | Establish metrics before changes |
| Evaluating only successes | Include failure analysis |
Implementation Guidelines
- •Define quality dimensions relevant to your use case
- •Create rubrics with actionable level descriptions
- •Build test sets from real patterns plus edge cases
- •Establish baselines before making changes
- •Automate evaluation pipelines for consistency
- •Supplement with human review for critical paths
- •Track metrics longitudinally for trend detection
- •Set quality gates that block bad deployments
- •Categorize failures to identify patterns
- •Iterate on rubrics as you learn more
References
- •Agent-Skills-for-Context-Engineering: Evaluation
- •BrowseComp benchmark methodology
- •LLM-as-Judge research papers