RAG Implementer
Build production-ready retrieval-augmented generation systems.
Core Principle
RAG = Retrieval + Context Assembly + Generation
Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.
⚠️ Prerequisites & Cost Reality Check
STOP: Have You Validated the Need for RAG?
Before implementing RAG, confirm:
- • Problem validated - Completed
product-strategistPhase 1 (problem discovery) - • Users need AI search - Tested with simpler alternatives (see below)
- • ROI justified - Calculated cost vs benefit of RAG vs alternatives
Try These FIRST (Before RAG)
RAG is powerful but expensive. Try cheaper alternatives first:
1. FAQ Page / Documentation (1 day, $0)
- •Create well-organized FAQ or docs
- •Add search with Cmd+F
- •Works for: <50 common questions, static content
- •Test: Do users find answers? If yes, stop here.
2. Simple Keyword Search (2-3 days, $0-20/month)
- •Use Algolia, Typesense, or PostgreSQL full-text search
- •Good enough for 80% of use cases
- •Works for: <100k documents, keyword matching sufficient
- •Test: Do users get relevant results? If yes, stop here.
3. Manual Curation (Concierge MVP) (1 week, $0)
- •Manually answer user questions
- •Build FAQ from common questions
- •Works for: <100 users, validating if users want AI
- •Test: Do users value your answers enough to pay? If yes, consider RAG.
4. Simple Semantic Search (1 week, $30-50/month)
- •Use OpenAI embeddings + Postgres pgvector
- •Skip complex retrieval, re-ranking, etc.
- •Works for: <50k documents, basic semantic search
- •Test: Are embeddings better than keyword search? If no, stop here.
Cost Reality Check
Naive RAG (Prototype):
- •Time: 1-2 weeks
- •Cost: $50-150/month (vector DB + embeddings + API calls)
- •When: Prototype, <10k documents, proof of concept
Advanced RAG (Production):
- •Time: 3-4 weeks
- •Cost: $200-500/month (hybrid search, re-ranking, monitoring)
- •When: Production, 10k-1M documents, validated demand
Modular RAG (Enterprise):
- •Time: 6-8 weeks
- •Cost: $500-2000+/month (multiple KBs, specialized modules)
- •When: Enterprise, 1M+ documents, mission-critical
Decision Tree: Do You Really Need RAG?
Do users need to search your content?
│
├─ No → Don't build RAG ❌
│
└─ Yes
├─ <50 items? → FAQ page ✅ ($0)
│
└─ >50 items?
├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)
│
└─ Need semantic understanding?
├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)
│
└─ >50k docs?
├─ Validated with users? → Build RAG ✅
└─ Not validated? → Test with Concierge MVP first ⚠️
Validation Checklist
Only proceed with RAG implementation if:
- • Tested simpler alternatives (FAQ, keyword search, manual curation)
- • Users confirmed they need AI-powered search (not just you think they do)
- • Calculated ROI: cost of RAG < value users get
- • Have >50k documents OR complex semantic search requirements
- • Budget: $200-500/month for infrastructure
- • Time: 3-4 weeks for production implementation
If any checkbox is unchecked: Go back to product-strategist or mvp-builder skills to validate first.
See also: PLAYBOOKS/validation-first-development.md for step-by-step validation process.
8-Phase RAG Implementation
Phase 1: Knowledge Base Design
Goal: Create well-structured knowledge foundation
Actions:
- •Map data sources (internal: docs, databases, APIs / external: web, feeds)
- •Filter noise, select authoritative content (prevent "data dump fallacy")
- •Define chunking strategy: semantic chunking based on structure
- •Add metadata: tags, timestamps, source identifiers, categories
Validation:
- • All data sources catalogued and prioritized
- • Data quality assessed (accuracy, completeness, freshness)
- • Chunking strategy tested with sample documents
- • Metadata schema validated for search effectiveness
Common Chunking Strategies:
- •Fixed-size: 500-1000 tokens, 50-100 token overlap
- •Semantic: By paragraph, section headers, or topic boundaries
- •Recursive: Split by structure (markdown headers, code blocks)
Phase 2: Embedding Strategy
Goal: Choose optimal embedding approach for semantic understanding
Actions:
- •Select embedding model:
text-embedding-3-large(1536 dim) for general, domain-specific for specialized - •Plan multi-modal needs (text, code, images, tables)
- •Decide on fine-tuning: use domain data if general embeddings underperform
- •Establish similarity benchmarks
Validation:
- • Embedding model benchmarked on domain data
- • Retrieval accuracy tested with known query-document pairs
- • Storage and compute costs validated
Model Selection:
- •General: OpenAI
text-embedding-3-large,text-embedding-3-small - •Code:
code-search-babbage-code-001or StarEncoder - •Multilingual:
multilingual-e5-large
Phase 3: Vector Store Architecture
Goal: Implement scalable vector database
Actions:
- •Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
- •Configure index: HNSW for speed, IVF for scale
- •Plan scalability: data growth and query volume
- •Implement backup, recovery, security
Validation:
- • Vector store benchmarked under expected load
- • Index optimized for retrieval speed and accuracy
- • Backup and recovery tested
- • Security controls implemented
Vector DB Decision:
- •Managed cloud → Pinecone
- •Self-hosted, feature-rich → Weaviate
- •Lightweight, local → Chroma
- •Cost-conscious → pgvector (Postgres extension)
- •High-performance → Qdrant
Phase 4: Retrieval Pipeline
Goal: Build sophisticated retrieval beyond simple similarity search
Actions:
- •Implement hybrid retrieval: semantic search + keyword (BM25)
- •Add query enhancement: expansion, reformulation, multi-query
- •Apply contextual filtering: metadata, temporal constraints, relevance ranking
- •Design for query types: factual (precision), analytical (breadth), creative (diversity)
- •Handle edge cases: no relevant results found
Advanced Techniques:
- •Re-ranking: Use cross-encoder after initial retrieval (e.g.,
cross-encoder/ms-marco-MiniLM-L-12-v2) - •Query routing: Route different query types to specialized strategies
- •Ensemble methods: Combine multiple retrieval approaches
- •Adaptive retrieval: Adjust top-k based on query complexity
Validation:
- • Retrieval accuracy tested across diverse query types
- • Hybrid retrieval outperforms single-method baselines
- • Query latency meets requirements (<500ms ideal)
- • Edge cases and fallbacks tested
Phase 5: Context Assembly
Goal: Transform retrieved chunks into optimal LLM context
Actions:
- •Rank and select: prioritize by relevance score, recency, source authority
- •Synthesize: merge related chunks, avoid redundancy
- •Compress: use LLMLingua or similar for token optimization
- •Mitigate "lost in the middle": place critical info at start/end
- •Adapt dynamically: adjust context based on conversation history
Context Engineering Integration:
- •Blend RAG results with system instructions and user prompts
- •Maintain conversation coherence across multi-turn interactions
- •Implement context persistence for follow-up queries
- •Balance context size vs. information density
Validation:
- • Context relevance validated against human judgments
- • Token optimization maintains accuracy
- • Multi-turn conversations maintain coherence
- • Assembly latency <200ms
Phase 6: Evaluation & Metrics
Goal: Measure RAG system performance comprehensively
Retrieval Quality:
- •Precision@K: Fraction of top-K results that are relevant
- •Recall@K: Fraction of relevant docs in top-K
- •MRR (Mean Reciprocal Rank): Average rank of first relevant result
- •NDCG: Ranking quality with graded relevance
Generation Quality:
- •Faithfulness: Generated content accuracy vs. sources
- •Answer Relevance: Response relevance to query
- •Context Utilization: How effectively LLM uses retrieved info
- •Hallucination Rate: Frequency of unsupported claims
System Performance:
- •End-to-End Latency: Query to answer (<3 seconds target)
- •Retrieval Latency: Time to retrieve and rank (<500ms)
- •Token Efficiency: Information density per token
- •Cost Per Query: Combined retrieval + generation costs
Validation:
- • Baseline metrics established
- • A/B testing framework for config comparisons
- • Automated evaluation pipeline deployed
- • Human evaluation protocols for ground truth
Phase 7: Production Deployment
Goal: Deploy with enterprise-grade reliability and security
Deployment:
- •Containerize with Docker/Kubernetes
- •Implement load balancing across RAG instances
- •Add caching for frequent queries
- •Graceful degradation: fallback to base model on component failure
Security:
- •Role-based access controls for knowledge base
- •Data masking and PII protection
- •Audit logging for compliance
- •Prompt injection defense
Monitoring:
- •Real-time metrics dashboard (latency, cost, accuracy)
- •Query analysis for patterns and failure modes
- •Cost tracking and optimization alerts
- •Performance profiling for bottlenecks
Validation:
- • Production handles expected traffic
- • Security prevents unauthorized access
- • Monitoring provides actionable insights
- • Incident response procedures tested
Phase 8: Continuous Improvement
Goal: Establish processes for ongoing enhancement
Data Pipeline:
- •Automated knowledge base updates (real-time or scheduled)
- •Quality monitoring: detect data drift and degradation
- •Source diversification: add new data sources
- •Feedback integration: user corrections and preferences
Model Evolution:
- •Evaluate and migrate to improved embeddings
- •Fine-tune on domain data regularly
- •Upgrade architecture: Naive → Advanced → Modular RAG
- •Expand multi-modal support (images, audio, video)
Optimization:
- •Analyze query patterns, optimize for common needs
- •Improve cache hit rates
- •Tune vector indices regularly
- •Balance performance vs. costs
Validation:
- • Automated improvement pipelines functioning
- • Performance trends show improvement
- • User satisfaction increasing
- • System adapts to changing needs
Key RAG Principles
1. Relevance Over Volume
- •Quality curation > massive datasets
- •Remove outdated/low-quality content continuously
- •Prioritize most relevant info to prevent "lost in the middle"
2. Semantic Understanding
- •Use embeddings for true semantic matching, not just keywords
- •Recognize query intent (factual, analytical, creative)
- •Adapt retrieval strategy based on context
3. Multi-Modal Intelligence
- •Handle text, images, code, tables, structured data
- •Enable cross-modal retrieval (text query → image results)
- •Preserve document structure and formatting
4. Temporal Awareness
- •Prioritize recent info for time-sensitive topics
- •Maintain historical access when relevant
- •Integrate real-time data feeds for dynamic domains
5. Transparency & Trust
- •Always provide source citations
- •Indicate confidence levels
- •Explain why specific information was selected
Standard RAG Response Format
{
"answer": "Generated response incorporating retrieved information",
"sources": [
{
"content": "Retrieved text chunk",
"source": "Document/URL identifier",
"relevance_score": 0.95,
"chunk_id": "unique_identifier"
}
],
"confidence": 0.87,
"retrieval_metadata": {
"chunks_retrieved": 5,
"retrieval_time_ms": 150,
"generation_time_ms": 800
}
}
Critical Success Rules
Non-Negotiable:
- •✅ Source attribution for every response
- •✅ Validate generated content against sources (prevent hallucination)
- •✅ Filter sensitive data before retrieval
- •✅ Respond within latency thresholds (<3 seconds)
- •✅ Monitor and optimize costs continuously
- •✅ Comply with security policies
- •✅ Graceful degradation on failures
- •✅ Comprehensive testing before production
Quality Gates:
- •Before Production: >85% accuracy on evaluation dataset
- •Ongoing: User satisfaction >4.0/5.0
- •Performance: 95th percentile <5 seconds
- •Reliability: 99.5% uptime
- •Cost: Within 10% of budget
Advanced Patterns
Modular RAG Architecture
- •Search Module: Query understanding and reformulation
- •Memory Module: Long-term conversation persistence
- •Routing Module: Query routing to specialized knowledge bases
- •Predict Module: Anticipatory pre-loading based on context
Hybrid RAG + Fine-tuning
- •RAG for dynamic, frequently changing knowledge
- •Fine-tuning for domain-specific reasoning patterns
- •Combine strengths for maximum effectiveness
Related Resources
Related Skills:
- •
multi-agent-architect- For complex RAG orchestration - •
knowledge-graph-builder- For structured knowledge integration - •
performance-optimizer- For RAG system optimization
Related Patterns:
- •
META/DECISION-FRAMEWORK.md- Vector DB and embedding selection - •
STANDARDS/architecture-patterns/rag-pattern.md- RAG architecture details (when created)
Related Playbooks:
- •
PLAYBOOKS/deploy-rag-system.md- RAG deployment procedure (when created)