AgentSkillsCN

agentv-eval-builder

创建并维护 AgentV YAML 评估文件,以测试 AI 代理的性能。在新建评估文件、添加评估用例或配置评估器时,可使用此技能。

SKILL.md
--- frontmatter
name: agentv-eval-builder
description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring evaluators.

AgentV Eval Builder

Comprehensive docs: https://agentv.dev

Quick Start

yaml
description: Example eval
execution:
  target: default

evalcases:
  - id: greeting
    expected_outcome: Friendly greeting
    input: "Say hello"
    expected_output: "Hello! How can I help you?"
    rubrics:
      - Greeting is friendly and warm
      - Offers to help

Eval File Structure

Required: evalcases (array) Optional: description, execution, dataset

Eval case fields:

FieldRequiredDescription
idyesUnique identifier
expected_outcomeyesWhat the response should accomplish
input / input_messagesyesInput to the agent
expected_output / expected_messagesnoGold-standard reference answer
rubricsnoInline evaluation criteria
executionnoPer-case execution overrides
conversation_idnoThread grouping

Shorthand aliases:

  • input (string) expands to [{role: "user", content: "..."}]
  • expected_output (string/object) expands to [{role: "assistant", content: ...}]
  • Canonical input_messages / expected_messages take precedence when both present

Message format: {role, content} where role is system, user, assistant, or tool Content types: inline text, {type: "file", value: "./path.md"} File paths:

  • Relative (./path.md, ../dir/file.md) - resolved from eval file directory
  • Absolute (/docs/file.md) - resolved from repo root
  • Git URLs - GitHub, GitLab, Bitbucket blob/src URLs (cloned and cached automatically)

JSONL format: One eval case per line as JSON. Optional .yaml sidecar for shared defaults. See examples/features/basic-jsonl/.

Evaluator Types

Configure via execution.evaluators array. Multiple evaluators produce a weighted average score.

code_judge

yaml
- name: format_check
  type: code_judge
  script: uv run validate.py
  cwd: ./scripts          # optional working directory
  target: {}              # optional: enable LLM target proxy (max_calls: 50)

Contract: stdin JSON -> stdout JSON {score, hits, misses, reasoning} See references/custom-evaluators.md for templates.

llm_judge

yaml
- name: quality
  type: llm_judge
  prompt: ./prompts/eval.md     # markdown template or script config
  model: gpt-5-chat            # optional model override
  config:                       # passed to script templates as context.config
    strictness: high

Variables: {{question}}, {{expected_outcome}}, {{candidate_answer}}, {{reference_answer}}, {{input_messages}}, {{expected_messages}}, {{output_messages}}

  • Markdown templates: use {{variable}} syntax
  • TypeScript templates: use definePromptTemplate(fn) from @agentv/eval, receives context object with all variables + config

composite

yaml
- name: gate
  type: composite
  evaluators:
    - name: safety
      type: llm_judge
      prompt: ./safety.md
    - name: quality
      type: llm_judge
  aggregator:
    type: weighted_average
    weights: { safety: 0.3, quality: 0.7 }

Aggregator types: weighted_average, all_or_nothing, minimum, maximum, safety_gate

  • safety_gate: fails immediately if the named gate evaluator scores below threshold (default 1.0)

tool_trajectory

yaml
- name: tool_check
  type: tool_trajectory
  mode: any_order            # any_order | in_order | exact
  minimums:                  # for any_order
    knowledgeSearch: 2
  expected:                  # for in_order/exact
    - tool: knowledgeSearch
      args: { query: "search term" }   # partial deep equality match
    - tool: documentRetrieve
      args: any                        # any arguments accepted
      max_duration_ms: 5000            # per-tool latency assertion
    - tool: summarize                  # omit args to skip argument checking

field_accuracy

yaml
- name: fields
  type: field_accuracy
  match_type: exact          # exact | date | numeric_tolerance
  numeric_tolerance: 0.01    # for numeric_tolerance match_type
  aggregation: weighted_average  # weighted_average | all_or_nothing

Compares output_messages fields against expected_messages fields.

latency

yaml
- name: speed
  type: latency
  max_ms: 5000

cost

yaml
- name: budget
  type: cost
  max_usd: 0.10

token_usage

yaml
- name: tokens
  type: token_usage
  max_total_tokens: 4000

rubric (inline)

yaml
rubrics:
  - Simple string criterion
  - id: weighted
    expected_outcome: Detailed criterion
    weight: 2.0
    required: true

See references/rubric-evaluator.md for score-range mode and scoring formula.

CLI Commands

bash
# Run evaluation
bun agentv eval <file.yaml> [--eval-id <id>] [--target <name>] [--dry-run]

# Validate eval file
bun agentv validate <file.yaml>

# Compare results between runs
bun agentv compare <results1.jsonl> <results2.jsonl>

# Generate rubrics from expected_outcome
bun agentv generate rubrics <file.yaml> [--target <name>]

Schemas

  • Eval file: references/eval-schema.json
  • Config: references/config-schema.json