Designing Evaluations for Agents
Quick start
Collect or infer:
- •Agent capabilities to evaluate
- •Success criteria for each capability
- •Input scenarios (happy path, edge cases, adversarial)
- •Expected behaviors (not just outputs)
- •Measurable metrics
- •Pass/fail thresholds
Then produce output using TEMPLATES.md. Validate with RUBRIC.md.
Workflow
- •List agent capabilities to evaluate.
- •For each capability, define success criteria.
- •Design input scenarios covering normal, edge, and adversarial cases.
- •Define expected behaviors for each scenario.
- •Choose metrics and set thresholds.
- •Create evaluation harness or test specification.
- •Run the rubric check. Revise until it passes.
Degrees of freedom
- •Low freedom: Expected behaviors must be specific and verifiable.
- •Medium freedom: Scenario design can vary based on agent scope.
- •Allowed variation: Evaluation tooling and format as long as core criteria are testable.
Failure modes to avoid
- •Evaluating outputs only, not behaviors
- •Missing adversarial or edge case scenarios
- •Vague success criteria ("output is good")
- •No baseline or threshold defined
- •Testing happy path only
- •Metrics that don't reflect actual quality
References
- •Templates: TEMPLATES.md
- •Rubric: RUBRIC.md
- •Examples: EXAMPLES.md
- •Evaluation patterns: reference/evaluation-patterns.md
- •Metrics guide: reference/metrics-guide.md