AgentSkillsCN

libeval

libeval——RAG 评估系统。Evaluator 以“大模型即裁判”的模式统筹质量评估工作:CriteriaEvaluator 根据评分标准对回复进行打分,RecallEvaluator 用于衡量检索性能,TraceEvaluator 则对执行轨迹进行深度分析,而 EvalStore 则负责持久化评估结果。该系统可用于自动化质量测试、RAG 流水线的效能评估,以及智能体性能的全面检测。

SKILL.md
--- frontmatter
name: libeval
description: >
  libeval - RAG evaluation system. Evaluator orchestrates quality assessment
  using LLM-as-judge patterns. CriteriaEvaluator scores responses against
  rubrics. RecallEvaluator measures retrieval performance. TraceEvaluator
  analyzes execution traces. EvalStore persists results. Use for automated
  quality testing, RAG pipeline evaluation, and agent performance testing

libeval Skill

When to Use

  • Evaluating RAG agent response quality
  • Measuring retrieval recall and precision
  • Running automated quality assessments
  • Benchmarking agent performance over time

Key Concepts

Evaluator: Main orchestrator that runs test cases through the agent and collects metrics.

CriteriaEvaluator: Uses LLM-as-judge to score responses against defined criteria and rubrics.

RecallEvaluator: Measures how well the retrieval system returns relevant documents.

TraceEvaluator: Analyzes execution traces for performance and correctness.

Usage Patterns

Pattern 1: Run evaluation suite

javascript
import { Evaluator } from "@copilot-ld/libeval";

const evaluator = new Evaluator(config);
const results = await evaluator.run(testCases);
console.log(results.summary);

Pattern 2: Criteria-based evaluation

javascript
import { CriteriaEvaluator } from "@copilot-ld/libeval";

const criteria = new CriteriaEvaluator(llmClient);
const score = await criteria.evaluate(response, rubric);

Integration

Configured via config/eval.yml. Run via make eval. Uses libllm for LLM-as-judge.