eval

运行评估框架，衡量代理输出的质量。

SKILL.md

--- frontmatter

name: eval
description: Run the eval framework to measure agent output quality.
argument-hint: "[scenario.yaml] or leave blank for all"
allowed-tools: Read, Glob, Grep, Bash, Task
model: sonnet

/eval — Run Agent Quality Evals

Execute eval scenarios to measure agent output quality mechanically.

What It Does

•Reads scenario definitions from evals/scenarios/
•For each scenario: sets up workspace, runs agent, checks output
•Scores: files exist, content matches, forbidden patterns absent, commands pass
•Saves results to evals/results/ as timestamped JSON
•Prints summary scorecard

Usage

Run all scenarios:

bash

./scripts/eval.sh

Run a specific scenario:

bash

./scripts/eval.sh evals/scenarios/$ARGUMENTS

After Running

•Check evals/results/latest.json for detailed results
•If any scenario failed, investigate which checks failed
•Use failures to improve CLAUDE.md instructions or pre-commit hooks
•Track trends: are agents getting better over time?

Adding New Scenarios

Create a YAML file in evals/scenarios/ following the format in evals/README.md.

Each scenario needs:

•name: what we're testing
•setup: commands to prepare the workspace
•prompt: what to tell the agent
•checks: mechanical verification of output