AgentSkillsCN

skill-judge

依据官方规范,对代理技能的质量进行评估。当您需要审查 SKILL.md 文件、审计技能包、优化技能设计,或检验某项技能是否遵循最佳实践时,此技能将为您提供有力支持。它采用 8 个维度评分(总分 120 分),并提供切实可行的改进建议。当您需要审查技能、评估技能、审计技能、改进技能、提升技能质量,或对 SKILL.md 进行审查时,此技能将助您事半功倍。

SKILL.md
--- frontmatter
name: skill-judge
model: standard
description: Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.

Skill Judge

Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.

WHAT This Skill Does

Scores skills across 8 dimensions (120 points total) and provides specific, actionable improvement suggestions.

WHEN To Use

  • Reviewing/auditing a SKILL.md file
  • Improving an existing skill's design
  • Checking if a skill follows best practices
  • Before publishing a skill to the ecosystem

KEYWORDS: review skill, evaluate skill, audit skill, skill quality, SKILL.md

Installation

OpenClaw / Moltbot / Clawbot

bash
npx clawhub@latest install skill-judge

Core Philosophy

The Core Formula

Good Skill = Expert-only Knowledge − What Claude Already Knows

A Skill's value = its knowledge delta — the gap between what it provides and what the model already knows.

TypeDefinitionTreatment
ExpertClaude genuinely doesn't know thisMust keep — this is the Skill's value
ActivationClaude knows but may not think ofKeep if brief — serves as reminder
RedundantClaude definitely knows thisDelete — wastes tokens

Good Skill ratio: >70% Expert, <20% Activation, <10% Redundant


Evaluation Dimensions (120 points)

D1: Knowledge Delta (20 pts) — THE CORE DIMENSION

Does the Skill add genuine expert knowledge?

ScoreCriteria
0-5Explains basics Claude knows (tutorials, standard library usage)
6-10Mixed: some expert knowledge diluted by obvious content
11-15Mostly expert knowledge with minimal redundancy
16-20Pure knowledge delta — every paragraph earns its tokens

Red flags (instant ≤5): "What is [basic concept]", step-by-step tutorials, generic best practices

Green flags (high delta): Decision trees, non-obvious trade-offs, edge cases from experience, "NEVER do X because [non-obvious reason]"


D2: Mindset + Procedures (15 pts)

Does the Skill transfer expert thinking patterns AND domain-specific procedures?

ScoreCriteria
0-3Only generic procedures Claude already knows
4-7Has domain procedures but lacks thinking frameworks
8-11Good balance: thinking patterns + domain-specific workflows
12-15Expert-level: shapes thinking AND provides procedures Claude wouldn't know

Valuable thinking patterns: "Before [action], ask yourself: Purpose? Constraints? Differentiation?"

Valuable procedures: Domain-specific sequences, non-obvious ordering, critical steps easy to miss

Redundant procedures: Generic file operations, standard programming patterns


D3: Anti-Pattern Quality (15 pts)

Does the Skill have effective NEVER lists?

ScoreCriteria
0-3No anti-patterns mentioned
4-7Generic warnings ("avoid errors", "be careful")
8-11Specific NEVER list with some reasoning
12-15Expert-grade anti-patterns with WHY — things only experience teaches

Test: Would an expert read the anti-pattern list and say "yes, I learned this the hard way"?


D4: Specification Compliance — Especially Description (15 pts)

The description is THE MOST IMPORTANT field. It's the only thing the agent sees before deciding to load the skill.

ScoreCriteria
0-5Missing frontmatter or invalid format
6-10Has frontmatter but description is vague or incomplete
11-13Valid frontmatter, description has WHAT but weak on WHEN
14-15Perfect: comprehensive description with WHAT, WHEN, and trigger keywords

Description must answer:

  1. WHAT: What does this Skill do?
  2. WHEN: In what situations should it be used?
  3. KEYWORDS: What terms should trigger this Skill?

Poor: "Helps with document tasks"
Good: "Create, edit, and analyze .docx files. Use when working with Word documents, tracked changes, or professional document formatting."


D5: Progressive Disclosure (15 pts)

Does the Skill implement proper content layering?

LayerContentSize
1: Metadataname + description~100 tokens
2: SKILL.mdGuidelines, decision trees< 500 lines ideal
3: Resourcesscripts/, references/, assets/No limit
ScoreCriteria
0-5Everything dumped in SKILL.md (>500 lines, no structure)
6-10Has references but unclear when to load them
11-13Good layering with MANDATORY triggers present
14-15Perfect: decision trees + explicit triggers + "Do NOT Load" guidance

Good trigger: "MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read docx-js.md"

Bad trigger: Just listing references at the end without loading guidance


D6: Freedom Calibration (15 pts)

Is specificity appropriate for the task's fragility?

Task TypeShould HaveWhy
Creative/DesignHigh freedomMultiple valid approaches
Code reviewMedium freedomPrinciples exist but judgment required
File format operationsLow freedomOne wrong byte corrupts file
ScoreCriteria
0-5Severely mismatched (rigid scripts for creative, vague for fragile)
6-10Partially appropriate
11-13Good calibration for most scenarios
14-15Perfect freedom calibration throughout

Test: "If Agent makes a mistake, what's the consequence?" High consequence → Low freedom


D7: Pattern Recognition (10 pts)

Does the Skill follow an established pattern?

Pattern~LinesWhen to Use
Mindset~50Creative tasks requiring taste
Navigation~30Multiple distinct scenarios (routes to sub-files)
Philosophy~150Art/creation requiring originality
Process~200Complex multi-step projects
Tool~300Precise operations on specific formats
ScoreCriteria
0-3No recognizable pattern, chaotic structure
4-6Partially follows a pattern with significant deviations
7-8Clear pattern with minor deviations
9-10Masterful application of appropriate pattern

D8: Practical Usability (15 pts)

Can an Agent actually use this Skill effectively?

ScoreCriteria
0-5Confusing, incomplete, or untested guidance
6-10Usable but with noticeable gaps
11-13Clear guidance for common cases
14-15Comprehensive: edge cases, error handling, decision trees

Check for: Decision trees for multi-path scenarios, working code examples, error handling/fallbacks, edge cases covered


NEVER Do When Evaluating

  • Give high scores just because it "looks professional"
  • Ignore token waste — every redundant paragraph = deduction
  • Let length impress you — 43-line Skill can outperform 500-line Skill
  • Skip mentally testing the decision trees
  • Forgive explaining basics with "provides helpful context"
  • Overlook missing anti-patterns
  • Undervalue the description field — poor description = skill never gets used
  • Put "when to use" info only in the body (agent only sees description before loading)

Evaluation Protocol

Step 1: Knowledge Delta Scan

Read SKILL.md and mark each section:

  • [E] Expert: Claude doesn't know this — value-add
  • [A] Activation: Claude knows but reminder useful — acceptable
  • [R] Redundant: Claude knows this — should delete

Calculate ratio: E:A:R (target >70:20:10)

Step 2: Structure Analysis

code
[ ] Valid frontmatter (name ≤64 chars, comprehensive description)
[ ] Total lines in SKILL.md
[ ] Reference files and sizes
[ ] Pattern identification (Mindset/Navigation/Philosophy/Process/Tool)
[ ] Loading triggers present (if references exist)

Step 3: Score Each Dimension

For each dimension: find evidence, assign score, note improvements if < max

Step 4: Calculate Total & Grade

GradePercentageMeaning
A90%+ (108+)Excellent — production-ready
B80-89% (96-107)Good — minor improvements needed
C70-79% (84-95)Adequate — clear improvement path
D60-69% (72-83)Below Average — significant issues
F<60% (<72)Poor — needs fundamental redesign

Step 5: Generate Report

markdown
# Skill Evaluation Report: [Skill Name]

## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence]

## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset + Procedures | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |

## Critical Issues
[Must-fix problems]

## Top 3 Improvements
1. [Highest impact with specific guidance]
2. [Second priority]
3. [Third priority]

Common Failure Patterns

PatternSymptomFix
TutorialExplains what X is, basic library usageDelete basics. Focus on expert decisions.
Dump800+ lines, everything includedCore in SKILL.md (<300), details in references/
Orphan ReferencesReferences exist but never loadedAdd "MANDATORY - READ" at decision points
Checkbox ProcedureStep 1, Step 2... mechanicalTransform to "Before doing X, ask yourself..."
Vague Warning"Be careful", "avoid errors"Specific NEVER list with concrete examples
Invisible SkillGreat content, rarely activatedFix description: WHAT + WHEN + KEYWORDS
Wrong Location"When to use" in body, not descriptionMove triggers to description field
Over-EngineeredREADME, CHANGELOG, CONTRIBUTINGDelete. Only what Agent needs for the task.

The Meta-Question

"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"

If yes → genuine value. If no → compressing what Claude already knows.

The best Skills are compressed expert brains — 10 years of experience in 50 lines.