Eval Harness Skill

Overview

Systematic evaluation framework for verification loops with:

•Multiple grader types
•Pass@k metrics
•Quality gates
•Regression detection

Grader Types

1. Exact Match Grader

yaml

grader: exact_match
config:
  case_sensitive: false
  trim_whitespace: true
  
example:
  expected: "Hello World"
  actual: "hello world"
  result: PASS (case insensitive)

2. Contains Grader

yaml

grader: contains
config:
  required: ["function", "return", "export"]
  forbidden: ["console.log", "debugger", "any"]
  
example:
  code: "export function add(a, b) { return a + b; }"
  result: PASS (has required, no forbidden)

3. AST Grader

yaml

grader: ast
config:
  language: typescript
  rules:
    - no_any_types
    - has_return_type
    - max_complexity: 10
    
example:
  code: "function add(a: number, b: number): number { return a + b; }"
  result: PASS

4. Test Runner Grader

yaml

grader: test_runner
config:
  command: "npm test"
  coverage_threshold: 80
  timeout: 60s
  
example:
  tests_passed: 45/45
  coverage: 85%
  result: PASS

5. LLM Grader

yaml

grader: llm
config:
  model: claude-sonnet
  criteria:
    - code_quality
    - follows_requirements
    - no_security_issues
  threshold: 0.8
  
example:
  scores: [0.9, 0.85, 0.95]
  average: 0.9
  result: PASS

Pass@k Evaluation

Generate k samples and check if any pass:

yaml

pass_at_k:
  k: 3
  strategy: best_of_k
  
  samples:
    - attempt_1: FAIL (test error)
    - attempt_2: PASS
    - attempt_3: PASS
  
  result: PASS (2/3 passed)
  pass@1: 0.67
  pass@3: 1.0

Quality Gates

Gate Configuration

yaml

quality_gates:
  pre_commit:
    graders: [contains, test_runner]
    threshold: all_pass
    blocking: true
    
  pre_review:
    graders: [ast, llm]
    threshold: 0.8
    blocking: false
    
  pre_release:
    graders: [test_runner, security_scan]
    threshold: all_pass
    blocking: true

Gate Execution

bash

/verify --gate=pre_commit

Results:
  ✅ contains: PASS (no forbidden patterns)
  ✅ test_runner: PASS (100% tests, 85% coverage)
  
Gate Status: PASS

Checkpoint Evaluation

Compare against saved checkpoints:

yaml

checkpoint_eval:
  baseline: checkpoint-001
  current: HEAD
  
  metrics:
    tests_delta: +5 (45 → 50)
    coverage_delta: +3% (82% → 85%)
    complexity_delta: -2 (avg 8 → 6)
    
  regression_check:
    new_failures: 0
    removed_tests: 0
    
  result: IMPROVEMENT

Metrics Dashboard

code

📊 Evaluation Summary

Session: 2026-01-28
Evaluations: 12
Pass Rate: 83%

By Grader:
  exact_match:  10/10 (100%)
  contains:     11/12 (92%)
  test_runner:  9/12 (75%)
  ast:          10/12 (83%)

Trends:
  ↑ Coverage: 78% → 85%
  ↓ Complexity: 12 → 8
  → Test count: 45 stable

Integration

With verification-loop

yaml

verification_loop:
  on_execute:
    - run: eval-harness
    - graders: [test_runner, contains]
    
  on_checkpoint:
    - run: eval-harness
    - graders: [ast, llm]
    - save_metrics: true
    
  on_final:
    - run: eval-harness
    - graders: all
    - gate: pre_release

With continuous-learning-v2

yaml

learning_from_evals:
  on_repeated_failure:
    - capture_instinct: true
    - pattern: "failure reason"
    - solution: "how it was fixed"
    
  on_consistent_pass:
    - boost_confidence: true
    - related_instincts: [tag match]

Custom Graders

Create project-specific graders:

javascript

// .ai_state/graders/api-response.js
module.exports = {
  name: 'api_response',
  evaluate: async (code) => {
    const hasErrorHandling = /catch|try/.test(code);
    const hasTypedResponse = /Response</.test(code);
    
    return {
      pass: hasErrorHandling && hasTypedResponse,
      score: (hasErrorHandling ? 0.5 : 0) + (hasTypedResponse ? 0.5 : 0),
      details: { hasErrorHandling, hasTypedResponse }
    };
  }
};

Anti-Patterns

❌ Single grader for all checks ✅ Appropriate grader per concern

❌ Binary pass/fail only ✅ Scored results with thresholds

❌ Skip evals on "simple" changes ✅ Consistent evaluation always

❌ Ignore eval trends ✅ Track and act on regressions