designing-evaluations-for-agents

设计评估框架，全面衡量代理的行为表现与服务质量。适用于为基于大模型的代理创建测试套件、性能基准，或质量评估体系时使用。

SKILL.md

--- frontmatter

name: designing-evaluations-for-agents
description: Design evaluation frameworks that measure agent behavior and quality. Use when creating test suites, benchmarks, or quality assessments for LLM-based agents.

Designing Evaluations for Agents

Quick start

Collect or infer:

•Agent capabilities to evaluate
•Success criteria for each capability
•Input scenarios (happy path, edge cases, adversarial)
•Expected behaviors (not just outputs)
•Measurable metrics
•Pass/fail thresholds

Then produce output using TEMPLATES.md. Validate with RUBRIC.md.

Workflow

•List agent capabilities to evaluate.
•For each capability, define success criteria.
•Design input scenarios covering normal, edge, and adversarial cases.
•Define expected behaviors for each scenario.
•Choose metrics and set thresholds.
•Create evaluation harness or test specification.
•Run the rubric check. Revise until it passes.

Degrees of freedom

•Low freedom: Expected behaviors must be specific and verifiable.
•Medium freedom: Scenario design can vary based on agent scope.
•Allowed variation: Evaluation tooling and format as long as core criteria are testable.

Failure modes to avoid

•Evaluating outputs only, not behaviors
•Missing adversarial or edge case scenarios
•Vague success criteria ("output is good")
•No baseline or threshold defined
•Testing happy path only
•Metrics that don't reflect actual quality

References

•Templates: TEMPLATES.md
•Rubric: RUBRIC.md
•Examples: EXAMPLES.md
•Evaluation patterns: reference/evaluation-patterns.md
•Metrics guide: reference/metrics-guide.md