AI Evaluation (Evals)
Build systematic evaluation frameworks for AI/LLM products to measure quality, catch regressions, and improve model performance.
When to Use
- •Building product with LLM/AI components
- •Need to measure AI output quality systematically
- •Comparing models or prompts (A/B testing)
- •Detecting regressions before deployment
- •Benchmarking against competitors
- •Improving AI accuracy over time
- •Explaining AI decisions to stakeholders
Core Concept
AI Evaluation (Evals) ≠ Traditional Testing
Traditional software: Deterministic (same input → same output) AI/LLM systems: Probabilistic (same input → variable outputs)
Why Evals Are Hard:
- •Outputs are subjective (is this "good" writing?)
- •No single right answer (multiple valid responses)
- •Edge cases are infinite (can't test everything)
- •Models change behavior with updates
Solution: Build eval suites that:
- •Define quality metrics (what is "good"?)
- •Create representative test cases
- •Measure systematically (automated + human)
- •Track over time (catch regressions)
Workflow
Step 1: Define What You're Evaluating
markdown
## AI Component Taxonomy **CLASSIFICATION TASKS:** - Sentiment analysis (positive/negative/neutral) - Content moderation (safe/unsafe) - Intent detection (user wants X) - Entity recognition (extract names, dates) Eval approach: Accuracy, precision, recall, F1 score --- **GENERATION TASKS:** - Text generation (summaries, responses, creative writing) - Code generation (functions, scripts) - Recommendations (suggest items, next actions) - Translations (language → language) Eval approach: Quality scores, human preference, task success rate --- **RETRIEVAL TASKS:** - Search (find relevant documents) - Recommendation (rank items by relevance) - Question answering (retrieve + synthesize answer) Eval approach: Relevance, ranking quality (NDCG, MRR) --- **REASONING TASKS:** - Multi-step problem solving - Complex decision making - Causal inference Eval approach: Correctness, reasoning quality, step-by-step validation
Step 2: Build Your Eval Dataset
Golden Dataset: Curated examples with known correct outputs.
markdown
## Dataset Creation Framework **SIZE REQUIREMENTS:** - Minimum: 50-100 examples (manual review feasible) - Good: 500-1000 examples (covers edge cases) - Production: 5,000-10,000+ examples (statistical significance) **COMPOSITION:** 1. **Happy Path (40%)** - Typical, well-formed inputs 2. **Edge Cases (30%)** - Unusual but valid inputs 3. **Adversarial (20%)** - Deliberately tricky inputs 4. **Failure Cases (10%)** - Invalid inputs (test error handling) **EXAMPLE (Book Recommendation AI):** Happy Path: - "Recommend books like Harry Potter for my 10-year-old" - "My kid loved Percy Jackson, what's next?" Edge Cases: - "Books for advanced reader (7yo but reads at 5th grade level)" - "Fantasy but NO violence or romance" Adversarial: - "Best books" (too vague) - "Books about [topic that doesn't exist for kids]" Failure Cases: - Gibberish input - Adult content request **SOURCES FOR TEST CASES:** 1. **User Logs** - Real queries (anonymized) 2. **Team Brainstorm** - Manual generation 3. **Synthetic** - GPT-4 to generate test cases 4. **Competitor Comparison** - Test against their outputs 5. **Bug Reports** - Historical failures
Step 3: Define Evaluation Metrics
Quantitative Metrics:
markdown
## Metric Types ### ACCURACY METRICS (Classification) - **Accuracy:** (Correct predictions) / (Total) - **Precision:** (True Positives) / (True Positives + False Positives) - **Recall:** (True Positives) / (True Positives + False Negatives) - **F1 Score:** Harmonic mean of precision and recall Use when: Clear right/wrong answer (classification, extraction) --- ### QUALITY METRICS (Generation) - **Coherence:** Does output make sense? (1-5 scale) - **Relevance:** Does output answer the question? (1-5 scale) - **Helpfulness:** Is output useful to user? (1-5 scale) - **Safety:** Is output safe/appropriate? (pass/fail) - **Hallucination Rate:** % of outputs with false information Use when: Subjective quality assessment needed --- ### TASK SUCCESS METRICS - **Completion Rate:** % of tasks successfully completed - **User Satisfaction:** Thumbs up/down, NPS - **Time to Success:** How long to achieve goal - **Retry Rate:** % of users who re-prompt after first response Use when: Evaluating end-to-end task performance --- ### RANKING METRICS (Retrieval/Recommendation) - **MRR (Mean Reciprocal Rank):** Average of 1/rank of first relevant result - **NDCG (Normalized Discounted Cumulative Gain):** Quality of ranking - **Precision@K:** % of top K results that are relevant Use when: Evaluating search or recommendation quality
Qualitative Metrics:
markdown
## Human Evaluation **PAIRWISE COMPARISON:** Show human raters two outputs (A vs B), ask "Which is better?" - Advantage: Easier than absolute rating - Disadvantage: Slower, requires more comparisons **LIKERT SCALE RATING:** Rate outputs 1-5 on dimensions (coherence, helpfulness, safety) - Advantage: Fast, can aggregate scores - Disadvantage: Subjective, rater disagreement **TASK COMPLETION:** Can human complete task using AI output? - Advantage: Measures real utility - Disadvantage: Slow, expensive **RED TEAM REVIEW:** Experts try to find failures (adversarial testing) - Advantage: Finds edge cases - Disadvantage: Not systematic
Step 4: Automated Evaluation Strategies
Use LLM as Judge:
markdown
## LLM-as-Evaluator Pattern
**CONCEPT:** Use GPT-4 (or strong model) to evaluate outputs from your AI.
**PROMPT TEMPLATE:**
"You are an expert evaluator. Rate the following AI response on:
1. Relevance (1-5)
2. Accuracy (1-5)
3. Helpfulness (1-5)
User query: {query}
AI response: {response}
Ground truth (if available): {truth}
Provide ratings and brief explanation."
**ADVANTAGES:**
✅ Scalable (can eval thousands of examples)
✅ Consistent (same rubric every time)
✅ Fast (seconds per eval)
✅ Cheap (pennies per eval)
**DISADVANTAGES:**
❌ Not 100% reliable (LLM judge can be wrong)
❌ Requires validation (compare to human ratings)
❌ Can miss nuanced failures
**VALIDATION:**
- Run LLM judge on 100-200 examples
- Have humans also rate same examples
- Calculate inter-rater agreement (Cohen's kappa)
- If agreement >70%, LLM judge is trustworthy
Step 5: Build Eval Pipeline
Continuous Evaluation System:
markdown
## Eval Pipeline Architecture **COMPONENTS:** 1. **Test Suite Storage** - JSON/CSV of test cases - Version controlled (git) - Tagged by category (happy path, edge case, etc.) 2. **Runner Script** - Iterate through test cases - Call AI system with each input - Collect outputs - Log latency, cost, errors 3. **Scorer** - Compare output to expected (if available) - Run automated metrics (accuracy, ROUGE, BLEU, etc.) - Call LLM judge for quality rating - Aggregate scores 4. **Regression Detection** - Compare current run to baseline - Flag significant drops (e.g., accuracy down >5%) - Alert team if regression detected 5. **Reporting Dashboard** - Visualize metrics over time - Drill down into failures - Compare models/prompts side-by-side **FREQUENCY:** - Pre-deploy: Every code/prompt change - Nightly: Full suite run on production - Weekly: Human review of sample outputs
Step 6: Common Eval Patterns
markdown
## Evaluation Strategies by Use Case ### RECOMMENDATION SYSTEMS **Test:** Does user engage with recommendation? Metrics: - Click-through rate (CTR) - Conversion rate (purchase, complete) - Time spent with recommended item - Diversity (not all same type) Golden Dataset: - Historical user behavior (X user liked Y, did they like Z?) - Synthetic: "If user likes [A, B, C], recommend [D]?" --- ### CONTENT MODERATION **Test:** Does it correctly flag unsafe content? Metrics: - Precision (flagged = actually unsafe) - Recall (actually unsafe = flagged) - False positive rate (safe content flagged) Golden Dataset: - Curated examples of safe/unsafe content - Edge cases (satire, context-dependent) --- ### SUMMARIZATION **Test:** Does summary capture key points? Metrics: - ROUGE score (overlap with reference summary) - Factual consistency (no hallucinations) - Compression ratio (length of summary / original) Golden Dataset: - Documents with human-written summaries - Check: All key facts present, no false info --- ### CODE GENERATION **Test:** Does generated code work? Metrics: - Syntax correctness (parses without errors) - Functional correctness (passes unit tests) - Code quality (readable, efficient) Golden Dataset: - Programming problems with test cases - Example: "Write function that reverses string" + 10 test cases --- ### CONVERSATIONAL AI **Test:** Does it handle multi-turn conversation well? Metrics: - Coherence across turns - Context retention (remembers earlier messages) - Task completion rate (user achieves goal) - Safety (doesn't generate harmful content) Golden Dataset: - Scripted conversations with expected paths - User logs (real conversations, anonymized)
Step 7: A/B Testing for AI
Compare models, prompts, or configurations:
markdown
## AI A/B Testing Framework **SETUP:** 1. Define variants (Model A vs. Model B, or Prompt v1 vs. v2) 2. Random assignment (50/50 split) 3. Define success metric (accuracy, user satisfaction, task completion) 4. Minimum sample size (depends on expected effect size) **METRICS TO TRACK:** - Primary: Quality (accuracy, preference, satisfaction) - Secondary: Latency, cost, error rate - Guardrails: Safety violations, user complaints **STATISTICAL SIGNIFICANCE:** - Run until p < 0.05 (95% confidence) - Typically need 1,000-10,000 samples depending on effect size - Use tools: Optimizely, LaunchDarkly, or custom **COMMON TESTS:** - Model comparison: GPT-4 vs. Claude vs. Gemini - Prompt engineering: Version A vs. B - Temperature: 0.7 vs. 0.9 (creativity vs. consistency) - Context window: Include X vs. Y tokens of context **EXAMPLE:** Variant A: GPT-4 with prompt v1 Variant B: GPT-4 with prompt v2 Metric: User thumbs up rate - A: 70% thumbs up (n=500) - B: 75% thumbs up (n=500) - Result: B wins, p=0.03 (significant) → Ship prompt v2
Common Eval Mistakes
markdown
## Anti-Patterns ❌ **No Golden Dataset** Testing AI without reference examples → Fix: Curate 100+ examples with expected outputs ❌ **Testing Only Happy Path** Ignoring edge cases and adversarial inputs → Fix: 30% of dataset should be edge cases ❌ **Manual Eval Only** Reviewing outputs one-by-one (doesn't scale) → Fix: Automate with LLM judge + spot-check humans ❌ **No Regression Detection** Shipping changes without comparing to baseline → Fix: Track metrics over time, alert on drops ❌ **Vanity Metrics** Measuring things that don't correlate with user value → Fix: Eval what matters (task success, user satisfaction) ❌ **Overfitting to Eval Set** Optimizing prompts specifically for test cases → Fix: Hold out test set, regularly refresh with new examples
Eval Tooling
Open Source:
- •LangSmith (LangChain) - Eval framework for LLM apps
- •Prompt flow (Microsoft) - End-to-end eval pipeline
- •Weights & Biases - Experiment tracking
- •Ragas - RAG evaluation framework
Commercial:
- •Anthropic Console - Claude model evals
- •OpenAI Evals - GPT model testing
- •HumanSignal - Human annotation platform
Related Skills
- •
/building-with-llms- Best practices for AI product development - •
/ai-product-strategy- Strategic AI product decisions - •
/testing-strategies- Traditional software testing - •
/performance-optimization- Optimize AI latency/cost
Last Updated: 2026-01-22