Agent Evaluation Skill
OpenCode evaluates VibeTeam agents using agents/benchmark.py and reports results in table format.
Workflow
- •Run each agent (AutoGen, CrewAI, OpenHands) with the given task
- •Call
ComparativeEvaluator.evaluate()fromagents/benchmark.py - •Present results in the table format below
Required Output Format
Evaluation Results
| Framework | Input | Output | Score | Feedback | Recommendations |
|---|---|---|---|---|---|
| AutoGen | {task} | {first 100 chars}... | 4/5 | {judge feedback} | {specific improvements} |
| CrewAI | {task} | {first 100 chars}... | 3/5 | {judge feedback} | {specific improvements} |
| OpenHands | {task} | {first 100 chars}... | 5/5 | {judge feedback} | {specific improvements} |
Summary
| Metric | Value |
|---|---|
| Winner | {framework name} |
| Reasoning | {why winner was chosen} |
| Judge Model | {model used, e.g. gpt-5-2} |
| Eval Time | {milliseconds} |
Scoring Scale
| Score | Meaning |
|---|---|
| 0 | Failed/error/refusal |
| 1 | Mostly wrong |
| 2 | Partial, missing key elements |
| 3 | Acceptable |
| 4 | Good, comprehensive |
| 5 | Excellent |
CLI Alternative
bash
cd ~/workspace/vibebrowser/VibeTeam set -a && source .env && set +a python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands
Predefined tasks: sentry-weekly-summary, github-issue-triage, release-notes
Key Files
| File | Purpose |
|---|---|
agents/benchmark.py | ComparativeEvaluator, QualityEvaluator |
agents/benchmark.py:286 | QualityEvaluator class |
agents/benchmark.py:459 | ComparativeEvaluator class |
Environment
Required before running:
bash
source .env
Variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT