AgentSkillsCN

agent-evaluation

指示 OpenCode 使用 agents/benchmark.py 运行 VibeTeam 智能体基准测试,并以表格形式呈现结果。

SKILL.md
--- frontmatter
name: agent-evaluation
description: Instruct OpenCode to run VibeTeam agent benchmarks using agents/benchmark.py and report results in table format
license: MIT
compatibility: opencode
metadata:
  audience: developers
  workflow: evaluation

Agent Evaluation Skill

OpenCode evaluates VibeTeam agents using agents/benchmark.py and reports results in table format.

Workflow

  1. Run each agent (AutoGen, CrewAI, OpenHands) with the given task
  2. Call ComparativeEvaluator.evaluate() from agents/benchmark.py
  3. Present results in the table format below

Required Output Format

Evaluation Results

FrameworkInputOutputScoreFeedbackRecommendations
AutoGen{task}{first 100 chars}...4/5{judge feedback}{specific improvements}
CrewAI{task}{first 100 chars}...3/5{judge feedback}{specific improvements}
OpenHands{task}{first 100 chars}...5/5{judge feedback}{specific improvements}

Summary

MetricValue
Winner{framework name}
Reasoning{why winner was chosen}
Judge Model{model used, e.g. gpt-5-2}
Eval Time{milliseconds}

Scoring Scale

ScoreMeaning
0Failed/error/refusal
1Mostly wrong
2Partial, missing key elements
3Acceptable
4Good, comprehensive
5Excellent

CLI Alternative

bash
cd ~/workspace/vibebrowser/VibeTeam
set -a && source .env && set +a
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Predefined tasks: sentry-weekly-summary, github-issue-triage, release-notes


Key Files

FilePurpose
agents/benchmark.pyComparativeEvaluator, QualityEvaluator
agents/benchmark.py:286QualityEvaluator class
agents/benchmark.py:459ComparativeEvaluator class

Environment

Required before running:

bash
source .env

Variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT