/eval — Run Agent Quality Evals
Execute eval scenarios to measure agent output quality mechanically.
What It Does
- •Reads scenario definitions from
evals/scenarios/ - •For each scenario: sets up workspace, runs agent, checks output
- •Scores: files exist, content matches, forbidden patterns absent, commands pass
- •Saves results to
evals/results/as timestamped JSON - •Prints summary scorecard
Usage
Run all scenarios:
bash
./scripts/eval.sh
Run a specific scenario:
bash
./scripts/eval.sh evals/scenarios/$ARGUMENTS
After Running
- •Check
evals/results/latest.jsonfor detailed results - •If any scenario failed, investigate which checks failed
- •Use failures to improve CLAUDE.md instructions or pre-commit hooks
- •Track trends: are agents getting better over time?
Adding New Scenarios
Create a YAML file in evals/scenarios/ following the format in evals/README.md.
Each scenario needs:
- •
name: what we're testing - •
setup: commands to prepare the workspace - •
prompt: what to tell the agent - •
checks: mechanical verification of output