AI Evaluation
Purpose
Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure.
Inputs
- •AI feature being evaluated (what it does, expected behavior)
- •Input data examples and edge cases
- •Quality requirements (accuracy thresholds, hallucination tolerance)
- •Existing evaluation infrastructure (if any)
- •Production monitoring requirements
Process
Step 1: Define Evaluation Dimensions
Identify what "good" means for this feature:
- •Correctness: Does the output match the expected answer?
- •Faithfulness: Does the output only use information from the provided context?
- •Relevance: Does the output answer the actual question asked?
- •Completeness: Does the output cover all aspects of the question?
- •Format compliance: Does the output match the expected structure?
- •Safety: Does the output avoid harmful, biased, or inappropriate content?
Step 2: Build Golden Dataset
Create a high-quality evaluation dataset:
- •Size: Minimum 50 examples, ideally 200+ for statistical significance
- •Distribution: Cover common cases (60%), edge cases (25%), adversarial cases (15%)
- •Labeling: Each example has input, expected output, and scoring criteria
- •Source: Real user queries (anonymized) + synthetically generated edge cases
- •Versioning: Dataset is version-controlled alongside the code
Step 3: Design Automated Scoring
Create scoring rubrics that can run without human review:
- •Exact match: For classification, extraction, or structured output (score: 0 or 1)
- •Semantic similarity: Embedding-based comparison of generated vs expected (score: 0-1)
- •LLM-as-judge: Use a stronger model to evaluate the output (score: 1-5 rubric)
- •Rule-based checks: Required fields present, format valid, no PII leaked
- •Composite score: Weighted combination of individual dimensions
Step 4: Design Hallucination Detection
Build specific checks for fabricated content:
- •Reference validation: Every cited fact must trace back to a source document
- •Entity verification: Named entities (people, dates, numbers) must appear in context
- •Confidence calibration: When the model says "I'm not sure," is it actually uncertain?
- •Contradiction detection: Does the output contradict the provided context?
- •Fabrication patterns: Common hallucination patterns to flag (fake URLs, invented citations)
Step 5: Design Regression Testing
Build a CI/CD-compatible evaluation pipeline:
- •Trigger: Run on prompt changes, model upgrades, or code changes affecting AI features
- •Threshold enforcement: Fail the build if eval score drops below threshold
- •Comparison reporting: Show score delta vs previous version, highlight regressions
- •Fast vs full: Quick smoke test (20 examples) for every commit, full eval (200+) for releases
Step 6: Design Production Monitoring
Plan ongoing quality monitoring:
- •Sampling: Evaluate X% of production requests against automated scoring
- •Feedback loop: User thumbs-up/down, explicit corrections
- •Drift detection: Score distribution shift over time (model degradation, data drift)
- •Alerting: Score drops below threshold, hallucination rate spikes, latency increases
Output Format
markdown
# AI Evaluation Framework ## Evaluation Dimensions | Dimension | Weight | Scoring Method | Threshold | |-----------|--------|---------------|-----------| | Correctness | 40% | LLM-as-judge (1-5) | ≥ 4.0 | | Faithfulness | 30% | Reference validation | ≥ 95% | | Relevance | 20% | Semantic similarity | ≥ 0.85 | | Format compliance | 10% | Rule-based | 100% | ## Golden Dataset **Size:** [N examples] **Distribution:** | Category | Count | Description | |----------|-------|-------------| | Common cases | N | [Description] | | Edge cases | N | [Description] | | Adversarial | N | [Description] | **Storage:** [Location in repo] **Versioning:** [Approach] ## Scoring Rubric ### Correctness (LLM-as-Judge) | Score | Criteria | |-------|---------| | 5 | Perfect — matches expected output in all aspects | | 4 | Good — minor differences that don't affect usefulness | | 3 | Acceptable — correct core answer with some issues | | 2 | Poor — partially correct but missing key information | | 1 | Wrong — incorrect or misleading answer | ## Hallucination Detection | Check | Method | Severity | |-------|--------|----------| | Reference validation | [Approach] | Critical | | Entity verification | [Approach] | High | | Contradiction detection | [Approach] | High | ## Regression Testing Pipeline
[Code change] → [Smoke test (20 examples)] → [Pass?] → [Merge] [Release] → [Full eval (200+ examples)] → [Pass threshold?] → [Deploy]
code
**Threshold:** Composite score ≥ [X] to pass **Reporting:** [Where results are published] ## Production Monitoring | Metric | Sample Rate | Alert Threshold | |--------|------------|-----------------| | Composite score | 5% of requests | < [X] | | Hallucination rate | 5% of requests | > [X%] | | User satisfaction | All feedback | < [X] thumbs-up rate |
Quality Checks
- • Golden dataset has at least 50 examples covering common and edge cases
- • Scoring rubric has clear criteria for each score level (not subjective)
- • Hallucination detection checks references against source documents
- • Regression testing is automated and blocks deploys on score drops
- • Production monitoring includes both automated scoring and user feedback
- • Eval dataset is version-controlled alongside the code