LLM Evaluation
Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.
Quick Reference
LLM-as-Judge Pattern
python
async def evaluate_quality(input_text: str, output_text: str, dimension: str) -> float:
response = await llm.chat([{
"role": "user",
"content": f"""Evaluate for {dimension}. Score 1-10.
Input: {input_text[:500]}
Output: {output_text[:1000]}
Respond with just the number."""
}])
return int(response.content.strip()) / 10
Quality Gate
python
QUALITY_THRESHOLD = 0.7
async def quality_gate(state: dict) -> dict:
scores = await full_quality_assessment(state["input"], state["output"])
passed = scores["average"] >= QUALITY_THRESHOLD
return {**state, "quality_passed": passed}
Hallucination Detection
python
async def detect_hallucination(context: str, output: str) -> dict:
# Check if output contains claims not in context
return {"has_hallucinations": bool, "unsupported_claims": []}
RAGAS Metrics (2026)
| Metric | Use Case | Threshold |
|---|---|---|
| Faithfulness | RAG grounding | ≥ 0.8 |
| Answer Relevancy | Q&A systems | ≥ 0.7 |
| Context Precision | Retrieval quality | ≥ 0.7 |
| Context Recall | Retrieval completeness | ≥ 0.7 |
Anti-Patterns (FORBIDDEN)
python
# ❌ NEVER use same model as judge and evaluated
output = await gpt4.complete(prompt)
score = await gpt4.evaluate(output) # Same model!
# ❌ NEVER use single dimension
if relevance_score > 0.7: # Only checking one thing
return "pass"
# ❌ NEVER set threshold too high
THRESHOLD = 0.95 # Blocks most content
# ✅ ALWAYS use different judge model
score = await gpt4_mini.evaluate(claude_output)
# ✅ ALWAYS use multiple dimensions
scores = await evaluate_all_dimensions(output)
if scores["average"] > 0.7:
return "pass"
Key Decisions
| Decision | Recommendation |
|---|---|
| Judge model | GPT-5.2-mini or Claude Haiku 4.5 |
| Threshold | 0.7 for production, 0.6 for drafts |
| Dimensions | 3-5 most relevant to use case |
| Sample size | 50+ for reliable metrics |
Detailed Documentation
| Resource | Description |
|---|---|
| references/evaluation-metrics.md | RAGAS & LLM-as-judge metrics |
| examples/evaluation-patterns.md | Complete evaluation examples |
| checklists/evaluation-checklist.md | Setup and review checklists |
| scripts/evaluator-template.py | Starter evaluation template |
Related Skills
- •
quality-gates- Workflow quality control - •
langfuse-observability- Tracking evaluation scores - •
agent-loops- Self-correcting with evaluation
Capability Details
llm-as-judge
Keywords: LLM judge, judge model, evaluation model, grader LLM Solves:
- •Use LLM to evaluate other LLM outputs
- •Implement judge prompts for quality
- •Configure evaluation criteria
ragas-metrics
Keywords: RAGAS, faithfulness, answer relevancy, context precision Solves:
- •Evaluate RAG with RAGAS metrics
- •Measure faithfulness and relevancy
- •Assess context precision and recall
hallucination-detection
Keywords: hallucination, factuality, grounded, verify facts Solves:
- •Detect hallucinations in LLM output
- •Verify factual accuracy
- •Implement grounding checks
quality-gates
Keywords: quality gate, threshold, pass/fail, evaluation gate Solves:
- •Implement quality thresholds
- •Block low-quality outputs
- •Configure multi-metric gates
batch-evaluation
Keywords: batch eval, dataset evaluation, bulk scoring, eval suite Solves:
- •Evaluate over golden datasets
- •Run batch evaluation pipelines
- •Generate evaluation reports
pairwise-comparison
Keywords: pairwise, A/B comparison, side-by-side, preference Solves:
- •Compare two model outputs
- •Implement preference ranking
- •Run A/B evaluations