AgentSkillsCN

ai-system-evaluation

端到端 AI 系统评估——模型选择、基准测试、成本/延迟分析、自建 vs 外购决策。在选择模型、设计评估管道或做出架构决策时,请使用此技能。

SKILL.md
--- frontmatter
name: ai-system-evaluation
description: End-to-end AI system evaluation - model selection, benchmarks, cost/latency analysis, build vs buy decisions. Use when selecting models, designing eval pipelines, or making architecture decisions.

AI System Evaluation

Evaluating AI systems end-to-end.

Evaluation Criteria

1. Domain-Specific Capability

DomainBenchmarks
Math & ReasoningGSM-8K, MATH
CodeHumanEval, MBPP
KnowledgeMMLU, ARC
Multi-turn ChatMT-Bench

2. Generation Quality

CriterionMeasurement
Factual ConsistencyNLI, SAFE, SelfCheckGPT
CoherenceAI judge rubric
RelevanceSemantic similarity
FluencyPerplexity

3. Cost & Latency

python
@dataclass
class PerformanceMetrics:
    ttft: float      # Time to First Token (seconds)
    tpot: float      # Time Per Output Token
    throughput: float # Tokens/second

    def cost(self, input_tokens, output_tokens, prices):
        return input_tokens * prices["input"] + output_tokens * prices["output"]

Model Selection Workflow

code
1. Define Requirements
   ├── Task type
   ├── Quality threshold
   ├── Latency requirements (<2s TTFT)
   ├── Cost budget
   └── Deployment constraints

2. Filter Options
   ├── API vs Self-hosted
   ├── Open source vs Proprietary
   └── Size constraints

3. Benchmark on Your Data
   ├── Create eval dataset (100+ examples)
   ├── Run experiments
   └── Analyze results

4. Make Decision
   └── Balance quality, cost, latency

Build vs Buy

FactorAPISelf-Host
Data PrivacyLess controlFull control
PerformanceBest modelsSlightly behind
Cost at ScaleExpensiveAmortized
CustomizationLimitedFull control
MaintenanceZeroSignificant

Public Benchmarks

BenchmarkFocus
MMLUKnowledge (57 subjects)
HumanEvalCode generation
GSM-8KMath reasoning
TruthfulQAFactuality
MT-BenchMulti-turn chat

Caution: Benchmarks can be gamed. Data contamination is common. Always evaluate on YOUR data.

Best Practices

  1. Test on domain-specific data
  2. Measure both quality and cost
  3. Consider latency requirements
  4. Plan for fallback models
  5. Re-evaluate periodically