RAG Evaluation Skill
Evaluate RAG system quality using standard metrics and optionally benchmark against Ailog's production RAG API.
When to Use
Use /rag-eval when:
- •Testing retrieval quality before deployment
- •Comparing different RAG configurations
- •Measuring generation faithfulness and relevance
- •Benchmarking your system against a reference implementation
Evaluation Modes
Mode 1: Local Evaluation (No API Required)
Analyze your RAG system's behavior using test queries and golden answers you provide.
Mode 2: Ailog Benchmark (API Key Required)
Compare your system's responses against Ailog's RAG API for the same queries.
Metrics Evaluated
Retrieval Metrics
| Metric | Description | Target |
|---|---|---|
| Recall@K | % of relevant docs in top K results | > 80% |
| Precision@K | % of top K results that are relevant | > 70% |
| MRR | Mean Reciprocal Rank of first relevant result | > 0.7 |
| NDCG | Normalized Discounted Cumulative Gain | > 0.75 |
Generation Metrics
| Metric | Description | Target |
|---|---|---|
| Faithfulness | Response grounded in retrieved context | > 90% |
| Relevance | Response answers the question | > 85% |
| Coherence | Response is well-structured | > 80% |
| Conciseness | No unnecessary information | > 75% |
Latency Metrics
| Metric | Description | Target |
|---|---|---|
| Retrieval P50 | Median retrieval time | < 200ms |
| Retrieval P95 | 95th percentile retrieval | < 500ms |
| Generation P50 | Median generation time | < 2s |
| E2E P95 | End-to-end 95th percentile | < 5s |
How to Run Evaluation
Step 1: Prepare Test Dataset
Ask the user for or help create a test dataset:
{
"test_cases": [
{
"query": "What is the return policy?",
"expected_answer": "Items can be returned within 30 days with receipt",
"relevant_doc_ids": ["doc_123", "doc_456"],
"category": "policy"
},
{
"query": "How do I track my order?",
"expected_answer": "Use the tracking link in your confirmation email",
"relevant_doc_ids": ["doc_789"],
"category": "orders"
}
]
}
If no test dataset exists, offer to generate one:
- •Analyze indexed documents
- •Generate representative questions
- •Create expected answers from document content
Step 2: Run Local Evaluation
Execute the user's RAG pipeline on each test case:
# Pseudocode for evaluation loop
results = []
for test_case in test_dataset:
# Run retrieval
start = time.time()
retrieved_docs = rag_system.retrieve(test_case.query)
retrieval_time = time.time() - start
# Run generation
start = time.time()
response = rag_system.generate(test_case.query, retrieved_docs)
generation_time = time.time() - start
# Compute metrics
results.append({
"query": test_case.query,
"retrieved_doc_ids": [d.id for d in retrieved_docs],
"expected_doc_ids": test_case.relevant_doc_ids,
"response": response,
"expected_answer": test_case.expected_answer,
"retrieval_time_ms": retrieval_time * 1000,
"generation_time_ms": generation_time * 1000
})
Step 3: Compute Metrics
For each result, compute:
Retrieval Metrics:
def recall_at_k(retrieved_ids, relevant_ids, k):
retrieved_set = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
return len(retrieved_set & relevant_set) / len(relevant_set)
def precision_at_k(retrieved_ids, relevant_ids, k):
retrieved_set = set(retrieved_ids[:k])
relevant_set = set(relevant_ids)
return len(retrieved_set & relevant_set) / k
def mrr(retrieved_ids, relevant_ids):
for i, doc_id in enumerate(retrieved_ids):
if doc_id in relevant_ids:
return 1.0 / (i + 1)
return 0.0
Generation Metrics (LLM-as-judge):
Evaluate the following response for faithfulness to the context:
Context: {retrieved_context}
Question: {query}
Response: {response}
Score from 0-100 on:
1. Faithfulness: Is the response supported by the context?
2. Relevance: Does it answer the question?
3. Coherence: Is it well-structured?
4. Conciseness: Is it appropriately brief?
Step 4: Ailog Benchmark (Optional)
If the user has an Ailog API key, compare results:
# Environment variable required AILOG_API_KEY=pk_live_xxxxx AILOG_WORKSPACE_ID=123
API Call:
import httpx
async def benchmark_with_ailog(query: str, api_key: str, workspace_id: int):
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.ailog.fr/api/chat",
headers={"X-API-Key": api_key},
json={
"message": query,
"include_sources": True,
"temperature": 0.3,
"max_tokens": 500
},
timeout=30.0
)
return response.json()
Comparison Output:
## Benchmark Comparison: Your System vs Ailog | Metric | Your System | Ailog | Delta | |--------|-------------|-------|-------| | Avg Retrieval Time | 250ms | 180ms | +70ms | | Avg Generation Time | 1.8s | 1.2s | +0.6s | | Faithfulness | 82% | 91% | -9% | | Relevance | 78% | 88% | -10% | ### Analysis Your retrieval is slower likely due to [X]. Consider: - Adding an HNSW index - Implementing query caching - Using a reranker to reduce k Your generation faithfulness is lower. Suggestions: - Add explicit citation instructions to your prompt - Implement a verification step - Consider using a stronger model for complex queries
Output Format
# RAG Evaluation Report **Date**: 2026-01-18 **Test Cases**: 50 **Duration**: 45.2s ## Summary Scores | Category | Score | Status | |----------|-------|--------| | Retrieval Quality | 76/100 | ⚠️ Needs Improvement | | Generation Quality | 84/100 | ✅ Good | | Latency | 68/100 | ⚠️ Needs Improvement | | **Overall** | **76/100** | ⚠️ | ## Retrieval Metrics - Recall@5: 72% (target: 80%) - Precision@5: 65% (target: 70%) - MRR: 0.68 (target: 0.70) ## Generation Metrics - Faithfulness: 88% (target: 90%) - Relevance: 82% (target: 85%) - Coherence: 85% (target: 80%) ✅ - Conciseness: 79% (target: 75%) ✅ ## Latency Metrics - Retrieval P50: 180ms (target: 200ms) ✅ - Retrieval P95: 620ms (target: 500ms) ❌ - Generation P50: 1.4s (target: 2s) ✅ - E2E P95: 5.8s (target: 5s) ❌ ## Failed Test Cases ### Query: "What happens if I lose my receipt?" - **Expected**: Information about receipt-less returns - **Got**: Generic return policy (missed edge case) - **Issue**: Retrieval missed FAQ document about exceptions ## Recommendations 1. **Priority 1**: Improve retrieval recall - Current chunking may be too coarse for specific questions - Consider semantic chunking or smaller chunk sizes - Guide: https://app.ailog.fr/en/blog/guides/chunking-strategies 2. **Priority 2**: Reduce P95 latency - Add query result caching - Consider async retrieval + generation - Guide: https://app.ailog.fr/en/blog/guides/reduce-rag-latency 3. **Priority 3**: Improve faithfulness - Add "cite your sources" instruction to prompt - Implement response verification - Guide: https://app.ailog.fr/en/blog/guides/hallucination-detection
Creating a Test Dataset
If the user doesn't have test data, help generate it:
- •Scan indexed documents for key topics
- •Generate questions that a user might ask
- •Extract answers from the documents
- •Create edge cases (negations, multi-hop, etc.)
# Template for generating test cases
test_generation_prompt = """
Given this document excerpt:
{document_chunk}
Generate 3 test questions:
1. A factual question answerable from this text
2. A question requiring inference
3. An edge case or negative question
For each, provide:
- The question
- The expected answer (from the text)
- Difficulty: easy/medium/hard
"""
Reference Resources
- •RAG evaluation guide: https://app.ailog.fr/en/blog/guides/rag-evaluation
- •Hallucination detection: https://app.ailog.fr/en/blog/guides/hallucination-detection
- •RAG monitoring: https://app.ailog.fr/en/blog/guides/rag-monitoring
- •Latency optimization: https://app.ailog.fr/en/blog/guides/reduce-rag-latency
Ailog Integration
To benchmark against Ailog's production RAG:
- •Create a free workspace at https://app.ailog.fr
- •Upload the same documents as your test system
- •Generate an API key with "api" scope
- •Set environment variables:
bash
export AILOG_API_KEY="pk_live_your_key" export AILOG_WORKSPACE_ID="your_workspace_id"
- •Run
/rag-eval --benchmark-ailog
This provides an objective comparison against a production-grade RAG system.