AgentSkillsCN

mlflow-evaluation

用于智能体开发的 MLflow 3 GenAI 评估。适用于以下场景:(1) 编写 mlflow.genai.evaluate() 代码;(2) 创建 @scorer 函数;(3) 从追踪数据中构建评估数据集;(4) 使用内置评分器(如“准则”、“正确性”、“安全性”、“检索相关性”);(5) 分析追踪数据中的延迟、错误或架构问题;(6) 优化智能体上下文、提示词及令牌使用;(7) 调试评估失败的情况。全面覆盖评估工作流:追踪数据分析 → 数据集构建 → 评分器创建 → 评估执行。

SKILL.md
--- frontmatter
name: mlflow-evaluation
description: "MLflow 3 GenAI evaluation for agent development. Use when (1) writing mlflow.genai.evaluate() code, (2) creating @scorer functions, (3) building evaluation datasets from traces, (4) using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), (5) analyzing traces for latency/errors/architecture, (6) optimizing agent context/prompts/token usage, (7) debugging evaluation failures. Covers the full eval workflow: trace analysis -> dataset building -> scorer creation -> evaluation execution."

MLflow 3 GenAI Evaluation

Before Writing Any Code

  1. Read GOTCHAS.md - 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

StepActionReference Files
1Understand what to evaluateuser-journeys.md (Journey 0: Strategy)
2Learn API patternsGOTCHAS.md + CRITICAL-interfaces.md
3Build initial datasetpatterns-datasets.md (Patterns 1-4)
4Choose/create scorerspatterns-scorers.md + CRITICAL-interfaces.md (built-in list)
5Run evaluationpatterns-evaluation.md (Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

StepActionReference Files
1Search and filter tracespatterns-trace-analysis.md (MCP tools section)
2Analyze trace qualitypatterns-trace-analysis.md (Patterns 1-7)
3Tag traces for inclusionpatterns-datasets.md (Patterns 16-17)
4Build dataset from tracespatterns-datasets.md (Patterns 6-7)
5Add expectations/ground truthpatterns-datasets.md (Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

StepActionReference Files
1Profile latency by spanpatterns-trace-analysis.md (Patterns 4-6)
2Analyze token usagepatterns-trace-analysis.md (Pattern 9)
3Detect context issuespatterns-context-optimization.md (Section 5)
4Apply optimizationspatterns-context-optimization.md (Sections 1-4, 6)
5Re-evaluate to measure impactpatterns-evaluation.md (Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

StepActionReference Files
1Establish baselinepatterns-evaluation.md (Pattern 4: named runs)
2Run current versionpatterns-evaluation.md (Pattern 1)
3Compare metricspatterns-evaluation.md (Patterns 6-7)
4Analyze failing tracespatterns-trace-analysis.md (Pattern 7)
5Debug specific failurespatterns-trace-analysis.md (Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

StepActionReference Files
1Understand scorer interfaceCRITICAL-interfaces.md (Scorer section)
2Choose scorer patternpatterns-scorers.md (Patterns 4-11)
3For multi-agent scorerspatterns-scorers.md (Patterns 13-16)
4Test with evaluationpatterns-evaluation.md (Pattern 1)

Reference Files Quick Lookup

ReferencePurposeWhen to Read
GOTCHAS.mdCommon mistakesAlways read first before writing code
CRITICAL-interfaces.mdAPI signatures, schemasWhen writing any evaluation code
patterns-evaluation.mdRunning evals, comparingWhen executing evaluations
patterns-scorers.mdCustom scorer creationWhen built-in scorers aren't enough
patterns-datasets.mdDataset buildingWhen preparing evaluation data
patterns-trace-analysis.mdTrace debuggingWhen analyzing agent behavior
patterns-context-optimization.mdToken/latency fixesWhen agent is slow or expensive
user-journeys.mdHigh-level workflowsWhen starting a new evaluation project

Critical API Facts

  • Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())
  • Data format: {"inputs": {"query": "..."}} (nested structure required)
  • predict_fn: Receives **unpacked kwargs (not a dict)

See GOTCHAS.md for complete list.