AgentSkillsCN

phoenix-evals

使用 Phoenix 构建并运行 AI/LLM 应用程序的评估器。

SKILL.md
--- frontmatter
name: phoenix-evals
description: Build and run evaluators for AI/LLM applications using Phoenix.
license: Apache-2.0
metadata:
  author: oss@arize.com
  version: "1.0.0"
  languages: Python, TypeScript

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

TaskFiles
Setupsetup-python, setup-typescript
Build code evaluatorevaluators-code-{python|typescript}
Build LLM evaluatorevaluators-llm-{python|typescript}, evaluators-custom-templates
Run experimentexperiments-running-{python|typescript}
Create datasetexperiments-datasets-{python|typescript}
Validate evaluatorvalidation, validation-calibration-{python|typescript}
Analyze errorserror-analysis, axial-coding
RAG evalsevaluators-rag
Productionproduction-overview, production-guardrails

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalsevaluators-{code\|llm}-{python\|typescript}validation-calibration-{python\|typescript}

RAG Systems: evaluators-ragevaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Rule Categories

PrefixDescription
fundamentals-*Types, scores, anti-patterns
observe-*Tracing, sampling
error-analysis-*Finding failures
axial-coding-*Categorizing failures
evaluators-*Code, LLM, RAG evaluators
experiments-*Datasets, running experiments
validation-*Calibrating judges
production-*CI/CD, monitoring

Key Principles

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5