AgentSkillsCN

Evals

通过代码、模型或人工评分员进行客观评估指标,采用 pass@k/pass^k 评分方式。适用于 eval、evaluate、测试代理、基准测试、验证行为、回归测试、能力测试、运行 eval、比较模型、比较提示、创建法官、创建用例、查看结果、未能完成任务、套件经理、记录捕捉、试验运行。

SKILL.md
--- frontmatter
name: Evals
description: Objective eval metrics via code/model/human graders with pass@k/pass^k scoring. USE WHEN eval, evaluate, test agent, benchmark, verify behavior, regression test, capability test, run eval, compare models, compare prompts, create judge, create use case, view results, failure to task, suite manager, transcript capture, trial runner.

Customization

Before executing, check for user customizations at: ~/.claude/PAI/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

  1. Send voice notification:

    bash
    curl -s -X POST http://localhost:8888/notify \
      -H "Content-Type: application/json" \
      -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \
      > /dev/null 2>&1 &
    
  2. Output text notification:

    code
    Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
    

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.


When to Activate

  • "run evals", "test this agent", "evaluate", "check quality", "benchmark"
  • "regression test", "capability test"
  • Compare agent behaviors across changes
  • Validate agent workflows before deployment
  • Verify ALGORITHM ISC rows
  • Create new evaluation tasks from failures

Core Concepts

Three Grader Types

TypeStrengthsWeaknessesUse For
Code-basedFast, cheap, deterministic, reproducibleBrittle, lacks nuanceTests, state checks, tool verification
Model-basedFlexible, captures nuance, scalableNon-deterministic, expensiveQuality rubrics, assertions, comparisons
HumanGold standard, handles subjectivityExpensive, slowCalibration, spot checks, A/B testing

Evaluation Types

TypePass TargetPurpose
Capability~70%Stretch goals, measuring improvement potential
Regression~99%Quality gates, detecting backsliding

Key Metrics

  • pass@k: Probability of at least 1 success in k trials (measures capability)
  • pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Request PatternRoute To
Run eval, evaluate suite, run tests, benchmarkWorkflows/RunEval.md
Compare models, model comparison, A/B test modelsWorkflows/CompareModels.md
Compare prompts, prompt comparison, test promptsWorkflows/ComparePrompts.md
Create judge, model grader, evaluation judgeWorkflows/CreateJudge.md
Create use case, new eval, test case, create suiteWorkflows/CreateUseCase.md
View results, eval results, scores, pass rateWorkflows/ViewResults.md

CLI Quick Reference

TriggerTool
Run suiteTools/AlgorithmBridge.ts
Log failureTools/FailureToTask.ts log
Convert failuresTools/FailureToTask.ts convert-all
Create suiteTools/SuiteManager.ts create
Check saturationTools/SuiteManager.ts check-saturation

Quick Reference

CLI Commands

bash
# Run an eval suite
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s <suite>

# Log a failure for later conversion
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

# Convert failures to test tasks
bun run ~/.claude/skills/Utilities/Evals/Tools/FailureToTask.ts convert-all

# Manage suites
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description"
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts list
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts check-saturation <name>
bun run ~/.claude/skills/Utilities/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

bash
# Run eval and update ISC row
bun run ~/.claude/skills/Utilities/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

code
| # | What Ideal Looks Like | Verify |
|---|----------------------|--------|
| 1 | Auth bypass fixed | eval:auth-security |
| 2 | Tests all pass | eval:regression |

Available Graders

Code-Based (Fast, Deterministic)

GraderUse Case
string_matchExact substring matching
regex_matchPattern matching
binary_testsRun test files
static_analysisLint, type-check, security scan
state_checkVerify system state after execution
tool_callsVerify specific tools were called

Model-Based (Nuanced)

GraderUse Case
llm_rubricScore against detailed rubric
natural_language_assertCheck assertions are true
pairwise_comparisonCompare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

DomainPrimary Graders
codingbinary_tests + static_analysis + tool_calls + llm_rubric
conversationalllm_rubric + natural_language_assert + state_check
researchllm_rubric + natural_language_assert + tool_calls
computer_usestate_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.


Task Schema (YAML)

yaml
task:
  id: "fix-auth-bypass_1"
  description: "Fix authentication bypass when password is empty"
  type: regression  # or capability
  domain: coding

  graders:
    - type: binary_tests
      required: [test_empty_pw.py]
      weight: 0.30

    - type: tool_calls
      weight: 0.20
      params:
        sequence: [read_file, edit_file, run_tests]

    - type: llm_rubric
      weight: 0.50
      params:
        rubric: prompts/security_review.md

  trials: 3
  pass_threshold: 0.75

Resource Index

ResourcePurpose
Types/index.tsCore type definitions
Graders/CodeBased/Deterministic graders
Graders/ModelBased/LLM-powered graders
Tools/TranscriptCapture.tsCapture agent trajectories
Tools/TrialRunner.tsMulti-trial execution with pass@k
Tools/SuiteManager.tsSuite management and saturation
Tools/FailureToTask.tsConvert failures to test tasks
Tools/AlgorithmBridge.tsALGORITHM integration
Data/DomainPatterns.yamlDomain-specific grader configs

Key Principles (from Anthropic)

  1. Start with 20-50 real failures - Don't overthink, capture what actually broke
  2. Unambiguous tasks - Two experts should reach identical verdicts
  3. Balanced problem sets - Test both "should do" AND "should NOT do"
  4. Grade outputs, not paths - Don't penalize valid creative solutions
  5. Calibrate LLM judges - Against human expert judgment
  6. Check transcripts regularly - Verify graders work correctly
  7. Monitor saturation - Graduate to regression when hitting 95%+
  8. Build infrastructure early - Evals shape how quickly you can adopt new models

Related

  • ALGORITHM: Evals is a verification method
  • Science: Evals implements scientific method
  • Browser: For visual verification graders