AgentSkillsCN

eval-harness

一套结构化的评估框架,可从多个维度衡量代码质量,提供可重复、有据可依的评分与评级。

SKILL.md
--- frontmatter
name: eval-harness
description: |
  Structured evaluation framework for measuring code quality across
  multiple dimensions. Provides repeatable, scored assessments with
  evidence-based ratings.
license: MIT
compatibility: Claude Code 2.1+
metadata:
  author: peopleforrester
  version: "1.0.0"
  tags: [quality, evaluation, metrics, verification]

Evaluation Harness

Structured framework for evaluating code quality with repeatable, evidence-based scoring.

Evaluation Dimensions

Default Rubric

DimensionWeightDescription
Correctness30%Tests pass, logic correct, edge cases handled
Security20%OWASP compliance, input validation, secrets
Performance15%Efficient queries, algorithms, caching
Maintainability15%Readability, modularity, naming
Testing10%Coverage, test quality, isolation
Documentation10%Accuracy, completeness, examples

Scoring Scale

ScoreLabelCriteria
5ExcellentExceeds standards, exemplary
4GoodMeets all standards
3AcceptableMeets minimum, room for improvement
2Below StandardMissing key requirements
1PoorSignificant issues, needs rework

Evaluation Process

1. Evidence Collection

For each dimension, gather concrete evidence:

markdown
#### Correctness Evidence
- Tests: 142 passing, 0 failing
- Edge cases: null handling verified in auth module
- Regression: no known regressions
- Score: 4/5 (missing boundary value tests for pagination)

2. Scoring

Apply scores with justification:

markdown
| Dimension | Score | Evidence |
|-----------|-------|----------|
| Correctness | 4/5 | Tests pass, missing pagination edge cases |
| Security | 3/5 | Input validation present, missing rate limiting |
| Performance | 4/5 | Queries optimized, indexes present |
| Maintainability | 5/5 | Clean architecture, clear naming |
| Testing | 3/5 | 72% coverage, below 80% target |
| Documentation | 4/5 | API docs current, missing setup guide update |

3. Weighted Score Calculation

code
Total = (4×0.30) + (3×0.20) + (4×0.15) + (5×0.15) + (3×0.10) + (4×0.10)
      = 1.20 + 0.60 + 0.60 + 0.75 + 0.30 + 0.40
      = 3.85 / 5.0

4. Verdict

RangeVerdictAction
4.0-5.0SHIP ITReady for production
3.0-3.9IMPROVEAddress findings before shipping
< 3.0REWORKSignificant issues need resolution

Custom Rubrics

Define project-specific rubrics:

yaml
rubric:
  - name: API Design
    weight: 25%
    checks:
      - RESTful conventions followed
      - Consistent error response format
      - Pagination on list endpoints
  - name: Database
    weight: 25%
    checks:
      - Migrations are reversible
      - Indexes cover common queries
      - No N+1 query patterns

Integration

Use with:

  • /eval command for on-demand evaluation
  • /verify command as part of pre-PR checks
  • CI pipeline as automated quality gate