AgentSkillsCN

llm-evaluation-designer

在为大语言模型应用设计评估套件时使用。建议在部署前后使用。该技能可生成评估框架、基准选择、测试用例设计,以及质量阈值设定。

SKILL.md
--- frontmatter
name: llm-evaluation-designer
description: Use when designing evaluation suites for LLM applications. Use before or after deployment. Produces evaluation frameworks, benchmark selection, test case design, and quality thresholds.

LLM Evaluation Designer

Overview

Design comprehensive evaluation frameworks for LLM-powered applications. Select appropriate benchmarks, create test cases, and define quality thresholds that matter for your use case.

Core principle: Evaluation should measure what matters for your specific use case, not just generic capabilities.

When to Use

  • Building new LLM application
  • Comparing model options
  • Establishing quality baselines
  • Continuous monitoring setup
  • Regression testing after changes

Output Format

yaml
llm_evaluation:
  application: "[Application name]"
  evaluation_date: "[YYYY-MM-DD]"
  
  use_case_definition:
    primary_task: "[What the LLM does]"
    input_type: "[What goes in]"
    output_type: "[What comes out]"
    success_criteria: "[What good looks like]"
    failure_modes: "[What bad looks like]"
  
  evaluation_dimensions:
    - dimension: "[Quality aspect]"
      weight: "[Importance 1-10]"
      metrics:
        - name: "[Metric name]"
          definition: "[How calculated]"
          threshold: "[Acceptable value]"
          measurement: "[Auto | Human | Hybrid]"
  
  test_suite:
    golden_set:
      size: "[N examples]"
      source: "[Where from]"
      coverage:
        - category: "[Input type/scenario]"
          count: "[N]"
          rationale: "[Why included]"
    
    test_cases:
      - id: "[TC-001]"
        category: "[Category]"
        input: "[Test input]"
        expected_output: "[Expected or acceptable outputs]"
        evaluation_criteria: "[How to score]"
        edge_case: [true | false]
    
    adversarial_tests:
      - category: "[Attack type]"
        examples: ["[Example inputs]"]
        expected_behavior: "[How model should respond]"
  
  benchmarks:
    standard:
      - name: "[Benchmark name]"
        relevance: "[Why applicable]"
        baseline_score: "[Industry baseline]"
        our_target: "[Our target]"
    
    custom:
      - name: "[Custom benchmark]"
        description: "[What it measures]"
        construction: "[How built]"
  
  evaluation_process:
    automated:
      - metric: "[Metric name]"
        tool: "[Evaluation tool]"
        frequency: "[When run]"
    
    human:
      - metric: "[Metric name]"
        evaluator_criteria: "[Who and how]"
        sample_size: "[N per period]"
        inter_rater_reliability: "[How to ensure consistency]"
  
  thresholds:
    launch_criteria:
      - metric: "[Metric]"
        minimum: "[Value]"
        blocking: [true | false]
    
    monitoring_alerts:
      - metric: "[Metric]"
        warning: "[Value]"
        critical: "[Value]"
  
  comparison:
    baseline_model: "[Current/baseline model]"
    comparison_methodology: "[How to compare fairly]"
    statistical_significance: "[Required confidence level]"

Evaluation Dimensions

Task-Specific Quality

DimensionMeasuresExample Metrics
AccuracyCorrectness of outputsExact match, F1, BLEU
RelevanceOutputs address the taskHuman rating, retrieval metrics
CompletenessAll required elements presentChecklist coverage
Format complianceMatches expected structureJSON validity, schema match

General Quality

DimensionMeasuresExample Metrics
CoherenceLogical flow and clarityHuman rating
FluencyNatural language qualityPerplexity, human rating
ConsistencySame input → similar outputOutput variance
GroundednessClaims supported by evidenceAttribution accuracy

Safety & Alignment

DimensionMeasuresExample Metrics
HarmlessnessNo harmful outputsSafety classifier, human audit
HonestyNo hallucinationsFactuality checks, source verification
HelpfulnessActually useful to userTask completion rate
BiasFairness across groupsDisparity metrics

Test Suite Design

Golden Set Construction

yaml
golden_set_strategy:
  coverage_requirements:
    - "Representative of production distribution"
    - "Include all major input categories"
    - "Include known edge cases"
    - "Include adversarial examples"
  
  size_guidance:
    minimum: "100 examples for statistical significance"
    recommended: "500+ for comprehensive coverage"
    per_category: "At least 20 examples per category"
  
  maintenance:
    review_frequency: "Quarterly"
    update_triggers:
      - "New failure modes discovered"
      - "Input distribution shifts"
      - "New features added"

Test Case Categories

yaml
test_categories:
  core_functionality:
    - "Typical inputs"
    - "Common variations"
    - "Expected happy path"
  
  edge_cases:
    - "Empty/minimal input"
    - "Maximum length input"
    - "Unusual formatting"
    - "Ambiguous requests"
  
  adversarial:
    - "Prompt injection attempts"
    - "Jailbreak attempts"
    - "Off-topic requests"
    - "Harmful content requests"
  
  robustness:
    - "Typos and misspellings"
    - "Multiple languages"
    - "Informal language"
    - "Technical jargon"

Automated Evaluation

Metrics by Output Type

Output TypeMetricsTools
ClassificationAccuracy, F1, confusion matrixsklearn
ExtractionPrecision, recall, exact matchCustom
GenerationBLEU, ROUGE, BERTScoreHuggingFace evaluate
SummarizationROUGE, factual consistencysummeval
QAExact match, F1, semantic similarityCustom

LLM-as-Judge

yaml
llm_judge:
  use_cases:
    - "Open-ended generation quality"
    - "Subjective assessments"
    - "Complex criteria"
  
  setup:
    judge_model: "[Typically GPT-4 or similar]"
    prompt_template: |
      Evaluate the following response on a scale of 1-5 for {{criterion}}.
      
      User query: {{query}}
      Response: {{response}}
      
      Score (1-5):
      Reasoning:
    
    calibration:
      - "Use reference examples at each score level"
      - "Regular inter-rater checks against humans"

Human Evaluation

When Required

yaml
human_eval_required:
  - "Subjective quality (helpfulness, tone)"
  - "Safety and harm assessment"
  - "Complex task correctness"
  - "Ground truth creation"

Evaluation Protocol

yaml
human_eval_protocol:
  evaluators:
    who: "[Domain experts | Trained raters | End users]"
    training: "[Calibration process]"
    count: "[At least 2 per example]"
  
  rating_scales:
    binary: "Accept/Reject for clear criteria"
    likert: "1-5 for subjective quality"
    ranking: "A vs B for comparison"
  
  quality_control:
    - "Include known examples to check rater quality"
    - "Measure inter-rater agreement (Kappa > 0.6)"
    - "Flag and review disagreements"

Threshold Setting

Launch Criteria

yaml
launch_criteria:
  must_pass:
    - metric: "Accuracy"
      threshold: ">95%"
      rationale: "Core functionality"
    
    - metric: "Safety classifier pass rate"
      threshold: ">99.9%"
      rationale: "Non-negotiable safety"
  
  should_meet:
    - metric: "Human quality rating"
      threshold: ">4.0/5.0"
      rationale: "User satisfaction"

Continuous Monitoring

yaml
monitoring_thresholds:
  accuracy:
    normal: ">95%"
    warning: "93-95%"
    critical: "<93%"
    window: "24 hours"
  
  response_time:
    normal: "<2s p95"
    warning: "2-5s p95"
    critical: ">5s p95"
    window: "1 hour"

A/B Testing

yaml
ab_test_design:
  hypothesis: "[What you expect to be better]"
  
  metrics:
    primary: "[Main metric to improve]"
    guardrails: ["[Metrics that shouldn't degrade]"]
  
  sample_size:
    calculation: "[Based on expected effect size]"
    minimum: "[Statistical requirement]"
  
  duration:
    minimum: "[For statistical significance]"
    maximum: "[Before business decision needed]"
  
  decision_criteria:
    ship_if: "[What constitutes success]"
    dont_ship_if: "[What stops the launch]"

Checklist

  • Use case and success criteria defined
  • Evaluation dimensions weighted by importance
  • Golden test set constructed with coverage
  • Automated metrics implemented
  • Human evaluation protocol defined
  • Thresholds set for launch and monitoring
  • Adversarial tests included
  • Comparison methodology documented