Agent Evaluation Skill
Evaluate agent performance using a structured scoring rubric.
Trigger Conditions
- •Agent configuration change
- •Evaluation cadence (monthly)
- •User invokes with "evaluate agents" or "agent scorecard"
Input Contract
- •Required: Agent(s) to evaluate
- •Required: Evaluation criteria or rubric
- •Optional: Baseline scores from prior evaluation
Output Contract
- •Evaluation scorecard (500-point rubric)
- •Per-dimension scores and findings
- •Improvement recommendations
- •Comparison against baseline
Tool Permissions
- •Read: Agent configs, agent output logs, telemetry
- •Write: Evaluation reports
- •Search: Agent invocation history
Execution Steps
- •Load evaluation rubric (architecture 100, security 100, ops 100, testing 100, docs 100)
- •For each agent, review recent outputs and effectiveness
- •Score each dimension with evidence
- •Compare against baseline scores
- •Identify improvement opportunities
- •Generate scorecard and recommendations
Success Criteria
- •All dimensions scored with evidence
- •Comparison against prior evaluation
- •Top 3 improvement recommendations per agent
- •Overall portfolio health assessment
Escalation Rules
- •Escalate if any agent scores below 50% on any dimension
- •Escalate if agent evaluation reveals conflicting outputs between agents
Example Invocations
Input: "Evaluate the security-specialist agent effectiveness"
Output: Scorecard: Security (85/100), Architecture (70/100), Ops (75/100), Testing (60/100), Docs (80/100). Total: 370/500. Findings: strong CVE detection but weak test coverage recommendations, documentation quality high but missing escalation follow-through. Top improvement: integrate with testing-specialist for security test gap analysis.