AgentSkillsCN

Regression Benchmarks

针对AI系统,构建回归基准测试体系,涵盖测试套件搭建、CI/CD集成与持续基准测试策略

SKILL.md
--- frontmatter
name: Regression Benchmarks
description: Comprehensive guide to regression benchmarks for AI systems including test suite creation, CI/CD integration, and continuous benchmarking strategies

Regression Benchmarks

What are Regression Benchmarks?

Definition: Test suite to detect performance degradation - runs on every change like unit tests to catch regressions before deployment.

Example

code
Before change: Accuracy 90%
After change: Accuracy 85%
→ Regression detected! ❌ Block deployment

Like Unit Tests for AI

code
Unit tests: Verify code correctness
Regression benchmarks: Verify AI performance

Both run automatically on every change

Why Regression Benchmarks Matter

New Features Can Break Existing Functionality

code
Add new feature: Multi-language support
Side effect: English accuracy drops from 90% to 85%
→ Regression benchmark catches this

Model Updates Can Reduce Quality

code
Update model: GPT-3.5 → GPT-4
Unexpected: Latency increases 5x
→ Regression benchmark catches this

Prompt Changes Can Have Unintended Effects

code
Change prompt: Add "Be concise"
Side effect: Completeness drops (answers too short)
→ Regression benchmark catches this

Catch Issues Early (Before Production)

code
Without benchmarks:
Deploy → Users complain → Rollback → Fix → Redeploy
Timeline: Days

With benchmarks:
Change → Benchmark fails → Fix → Benchmark passes → Deploy
Timeline: Hours

Components of Regression Suite

Test Cases (Inputs + Expected Outputs)

json
{
  "test_id": "test_001",
  "input": "What is the capital of France?",
  "expected_output": "Paris",
  "category": "geography",
  "priority": "high"
}

Evaluation Metrics (How to Score)

code
- Exact match accuracy
- F1 score
- Semantic similarity
- LLM-as-judge score
- Latency

Acceptance Thresholds (What's Acceptable)

code
Absolute thresholds:
- Accuracy > 90%
- Latency P95 < 500ms
- Safety violations = 0

Relative thresholds:
- Accuracy: No worse than -2% vs baseline
- Latency: No worse than +10% vs baseline

Automated Execution (CI/CD)

yaml
# Run on every PR
on: [pull_request]

jobs:
  regression-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Run regression benchmarks
        run: python run_benchmarks.py
      
      - name: Check thresholds
        run: |
          if [ $ACCURACY -lt 0.90 ]; then
            echo "Accuracy regression detected"
            exit 1
          fi

Creating Regression Test Cases

Representative Examples (Common Queries)

code
Include:
- Most frequent queries (80%)
- Cover all major categories
- Balanced difficulty

Example:

json
[
  {"input": "What is 2+2?", "expected": "4", "category": "math", "frequency": "high"},
  {"input": "Capital of France?", "expected": "Paris", "category": "geography", "frequency": "high"},
  {"input": "Who wrote Hamlet?", "expected": "Shakespeare", "category": "literature", "frequency": "medium"}
]

Edge Cases (Where Model Struggles)

code
Include:
- Ambiguous questions
- Multi-hop reasoning
- Rare topics
- Adversarial examples

Example:

json
[
  {"input": "What is the best programming language?", "expected": "Subjective, no single answer", "category": "edge_case"},
  {"input": "Capital of Atlantis?", "expected": "Atlantis is fictional", "category": "edge_case"}
]

Historical Failures (Bugs That Happened Before)

code
When bug occurs:
1. Fix bug
2. Add test case to prevent regression
3. Test case runs forever

Example:

json
{
  "test_id": "bug_fix_123",
  "input": "What is the population of Paris?",
  "expected": "Should include number, not just 'Paris has a population'",
  "bug_report": "Issue #123: Incomplete answers",
  "added_date": "2024-01-15"
}

Diverse Coverage (Different Categories)

code
Categories:
- Math (10%)
- Geography (10%)
- Science (10%)
- History (10%)
- General knowledge (60%)

Test Case Structure

Input (Question, Prompt, Image)

json
{
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  }
}

Expected Behavior (Correct Answer or Quality Criteria)

json
{
  "expected": {
    "type": "exact_match",
    "value": "Paris"
  }
}

Or quality criteria:

json
{
  "expected": {
    "type": "quality_criteria",
    "faithfulness": ">= 0.9",
    "relevance": ">= 0.9",
    "completeness": ">= 0.8"
  }
}

Context (If Needed for Evaluation)

json
{
  "context": "Paris is the capital and largest city of France."
}

Metadata (Category, Difficulty, Priority)

json
{
  "metadata": {
    "category": "geography",
    "difficulty": "easy",
    "priority": "high",
    "added_date": "2024-01-15",
    "last_updated": "2024-01-15"
  }
}

Complete Example:

json
{
  "test_id": "test_001",
  "input": {
    "type": "question",
    "text": "What is the capital of France?"
  },
  "expected": {
    "type": "exact_match",
    "value": "Paris",
    "acceptable_variants": ["paris", "PARIS"]
  },
  "context": "Paris is the capital and largest city of France.",
  "metadata": {
    "category": "geography",
    "difficulty": "easy",
    "priority": "high"
  }
}

Evaluation Strategies

Exact Match (For Deterministic Outputs)

python
def exact_match(predicted, expected):
    return predicted.strip().lower() == expected.strip().lower()

# Test
assert exact_match("Paris", "paris") == True
assert exact_match("Lyon", "Paris") == False

Semantic Similarity (For Generated Text)

python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(predicted, expected, threshold=0.8):
    emb1 = model.encode(predicted)
    emb2 = model.encode(expected)
    similarity = cosine_similarity([emb1], [emb2])[0][0]
    return similarity >= threshold

# Test
assert semantic_similarity("Paris is the capital", "The capital is Paris") == True

LLM-as-Judge (For Quality)

python
def llm_judge(question, answer, criteria):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate on criteria: {criteria}
    Score (1-5):
    """
    
    score = llm.generate(prompt)
    return int(score) >= 4  # Pass if >= 4

# Test
assert llm_judge("Capital of France?", "Paris", "correctness") == True

Human Spot Checks (Sample Validation)

python
# Sample 10% for human review
sample_size = len(test_cases) // 10
sample = random.sample(test_cases, sample_size)

for test in sample:
    human_score = get_human_evaluation(test)
    automated_score = run_test(test)
    
    # Validate automated evaluation
    if abs(human_score - automated_score) > 1:
        flag_for_review(test)

Acceptance Criteria

Absolute Threshold

python
# Must meet minimum standards
thresholds = {
    "accuracy": 0.90,  # >= 90%
    "latency_p95": 500,  # <= 500ms
    "safety_violations": 0  # = 0
}

def check_absolute_thresholds(results):
    if results["accuracy"] < thresholds["accuracy"]:
        return False, f"Accuracy {results['accuracy']:.2%} < {thresholds['accuracy']:.2%}"
    
    if results["latency_p95"] > thresholds["latency_p95"]:
        return False, f"Latency {results['latency_p95']}ms > {thresholds['latency_p95']}ms"
    
    if results["safety_violations"] > thresholds["safety_violations"]:
        return False, f"Safety violations {results['safety_violations']} > 0"
    
    return True, "All thresholds met"

Relative Threshold (vs Baseline)

python
# No worse than baseline
relative_thresholds = {
    "accuracy": -0.02,  # No worse than -2%
    "latency": 0.10  # No worse than +10%
}

def check_relative_thresholds(current, baseline):
    accuracy_change = current["accuracy"] - baseline["accuracy"]
    if accuracy_change < relative_thresholds["accuracy"]:
        return False, f"Accuracy dropped {accuracy_change:.2%}"
    
    latency_change = (current["latency"] - baseline["latency"]) / baseline["latency"]
    if latency_change > relative_thresholds["latency"]:
        return False, f"Latency increased {latency_change:.2%}"
    
    return True, "No regression vs baseline"

Per-Category Thresholds (Important Categories Stricter)

python
category_thresholds = {
    "safety": {"accuracy": 0.99},  # Stricter for safety
    "math": {"accuracy": 0.95},
    "general": {"accuracy": 0.90}
}

def check_category_thresholds(results_by_category):
    for category, results in results_by_category.items():
        threshold = category_thresholds.get(category, {"accuracy": 0.90})
        
        if results["accuracy"] < threshold["accuracy"]:
            return False, f"{category} accuracy {results['accuracy']:.2%} < {threshold['accuracy']:.2%}"
    
    return True, "All categories meet thresholds"

Regression Test Types

Accuracy Tests (Correct Answers)

python
def test_accuracy():
    results = run_model_on_test_set(test_cases)
    accuracy = sum(r["correct"] for r in results) / len(results)
    assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} < 90%"

Quality Tests (Output Quality)

python
def test_quality():
    results = evaluate_quality(test_cases)
    avg_quality = sum(r["quality_score"] for r in results) / len(results)
    assert avg_quality >= 4.0, f"Quality {avg_quality:.1f} < 4.0"

Latency Tests (Response Time)

python
def test_latency():
    latencies = measure_latency(test_cases)
    p95_latency = np.percentile(latencies, 95)
    assert p95_latency <= 500, f"P95 latency {p95_latency}ms > 500ms"

Safety Tests (No Harmful Outputs)

python
def test_safety():
    results = run_safety_checks(test_cases)
    violations = sum(r["is_unsafe"] for r in results)
    assert violations == 0, f"Safety violations: {violations}"

Consistency Tests (Same Input → Same Output)

python
def test_consistency():
    # Run same input 10 times
    for test_case in sample(test_cases, 10):
        outputs = [model.predict(test_case["input"]) for _ in range(10)]
        
        # Check all outputs are similar
        unique_outputs = set(outputs)
        assert len(unique_outputs) <= 2, f"Inconsistent outputs: {unique_outputs}"

Running Regression Suite

Trigger

code
- Every code change (PR)
- Every prompt change
- Every model update
- Nightly (full suite)

Execution

yaml
# GitHub Actions
name: Regression Tests

on:
  pull_request:
  schedule:
    - cron: '0 0 * * *'  # Nightly

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run regression benchmarks
        run: python run_benchmarks.py
      
      - name: Check results
        run: python check_thresholds.py

Duration

code
Fast enough for CI: Minutes, not hours

Strategies:
- Parallel execution
- Sample subset for PR (100 tests)
- Full suite nightly (1000 tests)

Reporting

code
Clear pass/fail + details

Example output:
✓ Accuracy: 92% (>= 90%)
✓ Latency P95: 450ms (<= 500ms)
✗ Safety: 2 violations (> 0)

Failed tests:
- test_042: Unsafe output detected
- test_137: Unsafe output detected

Overall: FAILED

Handling Failures

Investigate Root Cause

code
Failure detected → Investigate:
1. What changed? (code, prompt, model)
2. Which tests failed?
3. Why did they fail?

Intended Change (Update Benchmark)

code
Example:
- Changed behavior intentionally (new feature)
- Old test no longer valid
- Update test to reflect new expected behavior

Unintended Regression (Fix the Issue)

code
Example:
- Bug introduced
- Performance degraded
- Fix the code/prompt/model
- Re-run benchmarks

Flaky Test (Improve Test Stability)

code
Example:
- Test passes sometimes, fails sometimes
- Non-deterministic model output
- Fix: Use temperature=0, or semantic similarity instead of exact match

Golden Set

Curated Subset of Most Important Tests

code
Golden set: 100 most critical tests
Full suite: 1000 tests

Golden set:
- High-priority tests
- Core functionality
- Safety-critical
- Historical failures

High-Quality Ground Truth

code
Golden set has:
- Expert-verified answers
- Multiple annotators
- Regular review

Regularly Updated

code
Monthly: Review golden set
- Add new important tests
- Remove outdated tests
- Update expected outputs

Always Passing (Except When Intentional)

code
Golden set should always pass
If golden set fails → Critical issue
Block deployment until fixed

Versioning Benchmarks

Benchmark Version Tied to Model Version

code
Model v1.0 → Benchmark v1.0
Model v2.0 → Benchmark v2.0

Each model version has corresponding benchmark

Update Benchmarks When Requirements Change

code
New feature: Multi-language support
→ Add multi-language tests to benchmark
→ Increment benchmark version

Maintain History (Track Improvements Over Time)

code
Benchmark results over time:
v1.0: 85% accuracy
v1.1: 87% accuracy
v1.2: 90% accuracy
v2.0: 92% accuracy

Track progress!

Benchmark Coverage

Functional Coverage (All Features Tested)

code
Features:
- Q&A: 40% of tests
- Summarization: 20%
- Translation: 20%
- Code generation: 20%

All features covered

Edge Case Coverage (Error Conditions, Boundaries)

code
Edge cases:
- Empty input
- Very long input
- Ambiguous questions
- No answer available
- Multiple valid answers

Performance Coverage (Speed, Cost)

code
Performance tests:
- Latency (P50, P95, P99)
- Throughput (requests/sec)
- Cost ($ per 1000 requests)

Continuous Benchmarking

Run on Every PR (Fast Subset)

yaml
on: [pull_request]

jobs:
  quick-benchmark:
    runs-on: ubuntu-latest
    steps:
      - name: Run quick benchmark (100 tests)
        run: python run_benchmarks.py --quick

Run Nightly (Full Suite)

yaml
on:
  schedule:
    - cron: '0 0 * * *'  # Every night at midnight

jobs:
  full-benchmark:
    runs-on: ubuntu-latest
    steps:
      - name: Run full benchmark (1000 tests)
        run: python run_benchmarks.py --full

Run on Model Updates (Comprehensive)

yaml
on:
  workflow_dispatch:
    inputs:
      model_version:
        description: 'Model version to test'
        required: true

jobs:
  model-benchmark:
    runs-on: ubuntu-latest
    steps:
      - name: Run comprehensive benchmark
        run: python run_benchmarks.py --model ${{ github.event.inputs.model_version }}

Benchmark Metrics to Track

Pass Rate (% of Tests Passing)

python
pass_rate = passed_tests / total_tests
print(f"Pass rate: {pass_rate:.1%}")

# Track over time
# v1.0: 85%
# v1.1: 90%
# v1.2: 95%

Performance vs Baseline (% Change)

python
baseline_accuracy = 0.90
current_accuracy = 0.92
improvement = (current_accuracy - baseline_accuracy) / baseline_accuracy

print(f"Improvement: {improvement:.1%}")  # +2.2%

Latency (P50, P95, P99)

python
import numpy as np

latencies = [100, 150, 200, 250, 300, ...]

p50 = np.percentile(latencies, 50)
p95 = np.percentile(latencies, 95)
p99 = np.percentile(latencies, 99)

print(f"P50: {p50}ms, P95: {p95}ms, P99: {p99}ms")

Cost Per Test (LLM API Calls)

python
total_cost = sum(test["cost"] for test in results)
cost_per_test = total_cost / len(results)

print(f"Cost per test: ${cost_per_test:.4f}")

Optimization

Cache Model Responses (If Deterministic)

python
import hashlib
import json

cache = {}

def cached_predict(input_text, model):
    # Create cache key
    key = hashlib.md5(f"{input_text}{model.version}".encode()).hexdigest()
    
    # Check cache
    if key in cache:
        return cache[key]
    
    # Predict
    output = model.predict(input_text)
    
    # Cache
    cache[key] = output
    
    return output

Parallelize Test Execution

python
from concurrent.futures import ThreadPoolExecutor

def run_benchmarks_parallel(test_cases, model):
    with ThreadPoolExecutor(max_workers=10) as executor:
        results = list(executor.map(lambda t: run_test(t, model), test_cases))
    
    return results

Sample Large Test Suites (For Fast Feedback)

python
# For PR: Sample 100 tests (fast)
if is_pr:
    test_cases = random.sample(all_test_cases, 100)

# For nightly: Run all tests (comprehensive)
else:
    test_cases = all_test_cases

Integration with CI/CD

GitHub Actions Workflow

yaml
name: Regression Benchmarks

on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run benchmarks
        id: benchmark
        run: |
          python run_benchmarks.py > results.txt
          cat results.txt
      
      - name: Check thresholds
        run: |
          ACCURACY=$(grep "Accuracy:" results.txt | awk '{print $2}' | tr -d '%')
          if [ $ACCURACY -lt 90 ]; then
            echo "::error::Accuracy $ACCURACY% < 90%"
            exit 1
          fi
      
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: benchmark-results
          path: results.txt

Required Check (Block Merge If Failing)

yaml
# In GitHub repo settings:
# Branch protection rules → Require status checks
# Select: "benchmark" check

# PR cannot merge if benchmark fails

Clear Error Messages (Which Tests Failed)

python
def report_failures(results):
    failed = [r for r in results if not r["passed"]]
    
    if failed:
        print(f"\n❌ {len(failed)} tests failed:\n")
        for test in failed:
            print(f"  - {test['id']}: {test['error']}")
        
        print(f"\nOverall: FAILED ({len(failed)}/{len(results)} tests failed)")
        sys.exit(1)
    else:
        print(f"\n✓ All {len(results)} tests passed!")

Link to Details (Full Report)

python
# Generate HTML report
generate_html_report(results, "report.html")

# Upload to S3
upload_to_s3("report.html", f"s3://benchmarks/{commit_sha}/report.html")

# Comment on PR
comment_on_pr(f"Benchmark report: https://benchmarks.example.com/{commit_sha}/report.html")

Real-World Regression Examples

RAG System

python
test_cases = [
    {
        "question": "What is the capital of France?",
        "expected_faithfulness": ">= 0.9",
        "expected_relevance": ">= 0.9"
    },
    # ... 1000 more tests
]

# Run on every change
results = evaluate_rag(test_cases)

# Check thresholds
assert results["avg_faithfulness"] >= 0.9
assert results["avg_relevance"] >= 0.9

Chatbot

python
test_cases = [
    {
        "input": "Hello",
        "expected_quality": ">= 4.0",
        "expected_safety": "safe"
    },
    # ... 1000 more tests
]

# Run on every change
results = evaluate_chatbot(test_cases)

# Check thresholds
assert results["avg_quality"] >= 4.0
assert results["safety_violations"] == 0

Code Generation

python
test_cases = [
    {
        "prompt": "Write a function to reverse a string",
        "expected_correctness": "passes unit tests",
        "expected_style": "follows PEP 8"
    },
    # ... 1000 more tests
]

# Run on every change
results = evaluate_code_generation(test_cases)

# Check thresholds
assert results["correctness_rate"] >= 0.90
assert results["style_violations"] <= 0.10

Tools

pytest (Test Framework)

python
import pytest

def test_accuracy():
    results = run_benchmarks()
    assert results["accuracy"] >= 0.90

def test_latency():
    results = run_benchmarks()
    assert results["latency_p95"] <= 500

Custom Evaluation Scripts

See examples throughout this document

CI/CD Platforms

  • GitHub Actions
  • GitLab CI
  • CircleCI
  • Jenkins

Summary

Regression Benchmarks: Test suite to detect performance degradation

Why:

  • Catch regressions before deployment
  • New features can break existing functionality
  • Model/prompt changes can reduce quality

Components:

  • Test cases (inputs + expected outputs)
  • Evaluation metrics
  • Acceptance thresholds
  • Automated execution (CI/CD)

Test Types:

  • Accuracy, quality, latency, safety, consistency

Thresholds:

  • Absolute (>= 90% accuracy)
  • Relative (no worse than -2% vs baseline)
  • Per-category (stricter for important categories)

Running:

  • Trigger: Every PR, nightly, model updates
  • Duration: Fast (minutes)
  • Reporting: Clear pass/fail + details

Handling Failures:

  • Intended change → Update benchmark
  • Unintended regression → Fix issue
  • Flaky test → Improve stability

Golden Set:

  • 100 most critical tests
  • Always passing
  • Regularly updated

Continuous:

  • PR: Quick (100 tests)
  • Nightly: Full (1000 tests)
  • Model updates: Comprehensive

Optimization:

  • Cache responses
  • Parallelize execution
  • Sample for fast feedback

CI/CD:

  • Required check (block merge)
  • Clear error messages
  • Link to full report