AgentSkillsCN

researcher-evaluation

采用 G-Eval 方法论,使用 agents/benchmark.py 对 GenAI 智能体进行技术评估的实用手册。

SKILL.md
--- frontmatter
name: researcher-evaluation
description: Technical playbook for GenAI agent evaluation using agents/benchmark.py with G-Eval methodology
license: MIT
compatibility: opencode
metadata:
  audience: developers
  workflow: evaluation

Researcher Evaluation Skill

Technical playbook for evaluating GenAI agents. OpenCode uses agents/benchmark.py directly.


Required Output Format

Evaluation Results

FrameworkInputOutputScoreFeedbackRecommendations
AutoGen{task}{truncated}...4/5{feedback}{improvements}
CrewAI{task}{truncated}...3/5{feedback}{improvements}
OpenHands{task}{truncated}...5/5{feedback}{improvements}

Summary

MetricValue
Winner{framework}
Reasoning{why}
Judge Model{model}
Eval Time{ms}

Scoring Scale

ScoreMeaning
0Failed/error
1Mostly wrong
2Partial
3Acceptable
4Good
5Excellent

Evaluation Dimensions

DimensionDescription
AccuracyFacts correct, no hallucinations
CompletenessAll sub-tasks addressed
ActionabilityConcrete next steps
ClarityWell-structured
RelevanceOn topic
EfficiencyConcise

Methodology (G-Eval)

Chain-of-Thought evaluation steps:

  1. Check factual accuracy
  2. Verify task completion
  3. Assess actionability
  4. Check for hallucinations
  5. Evaluate conciseness → Score 0-5

CLI

bash
python -m agents.benchmark --tasks github-issue-triage --frameworks autogen crewai openhands

Key Files

FileLineClass/Function
agents/benchmark.py286QualityEvaluator
agents/benchmark.py459ComparativeEvaluator
agents/benchmark.py694Benchmark