AgentSkillsCN

adk-evaluate

精通ADK智能体的评估、轨迹分析与基于指标的测试。可用于为AI智能体实施单元测试与集成测试,衡量工具精度,以及开展基础性验证。

SKILL.md
--- frontmatter
name: adk-evaluate
description: Expert in ADK agent evaluation, trajectory analysis, and metric-based testing in Python. Use for implementing unit/integration tests for AI agents, measuring tool accuracy, and grounding checks.

ADK Evaluation Specialist (Python Edition)

Philosophy & Architecture

Agent evaluation requires qualitative assessment of both the final output and the trajectory (sequence of steps/tool calls).

Evaluation Methods

  1. Test Files (*.test.json):
    • Single-turn sessions for rapid development.
    • Ideal for unit testing via pytest.
  2. Evalsets (*.evalset.json):
    • Complex, multi-turn sessions for integration testing.
    • Ideal for regression testing in CI/CD pipelines via adk eval.

Core Metrics

  • tool_trajectory_avg_score: Exact match of expected tool calls (0.0 - 1.0).
  • response_match_score: ROUGE-1 similarity.
  • final_response_match_v2: LLM-judged semantic match.
  • hallucinations_v1: Groundedness against context.
  • safety_v1: Harmlessness assessment.
  • Read references/evaluate.md for full metric details and data schemas.

Execution

  • adk web: Interactive UI for recording sessions and visual comparisons.
  • adk eval: CLI command for automated batch testing.
  • pytest: Integration with AgentEvaluator.evaluate(...) for CI/CD.

Success Criteria

  • Valid JSON test files adhering to the ADK Pydantic schema.
  • Comprehensive coverage of critical "happy path" trajectories.
  • Successful integration with automated CI triggers.