RAG Implementer

Build production-ready retrieval-augmented generation systems.

Core Principle

RAG = Retrieval + Context Assembly + Generation

Use RAG when you need LLMs to access fresh, domain-specific, or proprietary knowledge that wasn't in their training data.

⚠️ Prerequisites & Cost Reality Check

STOP: Have You Validated the Need for RAG?

Before implementing RAG, confirm:

• Problem validated - Completed product-strategist Phase 1 (problem discovery)
• Users need AI search - Tested with simpler alternatives (see below)
• ROI justified - Calculated cost vs benefit of RAG vs alternatives

Try These FIRST (Before RAG)

RAG is powerful but expensive. Try cheaper alternatives first:

1. FAQ Page / Documentation (1 day, $0)

•Create well-organized FAQ or docs
•Add search with Cmd+F
•Works for: <50 common questions, static content
•Test: Do users find answers? If yes, stop here.

2. Simple Keyword Search (2-3 days, $0-20/month)

•Use Algolia, Typesense, or PostgreSQL full-text search
•Good enough for 80% of use cases
•Works for: <100k documents, keyword matching sufficient
•Test: Do users get relevant results? If yes, stop here.

3. Manual Curation (Concierge MVP) (1 week, $0)

•Manually answer user questions
•Build FAQ from common questions
•Works for: <100 users, validating if users want AI
•Test: Do users value your answers enough to pay? If yes, consider RAG.

4. Simple Semantic Search (1 week, $30-50/month)

•Use OpenAI embeddings + Postgres pgvector
•Skip complex retrieval, re-ranking, etc.
•Works for: <50k documents, basic semantic search
•Test: Are embeddings better than keyword search? If no, stop here.

Cost Reality Check

Naive RAG (Prototype):

•Time: 1-2 weeks
•Cost: $50-150/month (vector DB + embeddings + API calls)
•When: Prototype, <10k documents, proof of concept

Advanced RAG (Production):

•Time: 3-4 weeks
•Cost: $200-500/month (hybrid search, re-ranking, monitoring)
•When: Production, 10k-1M documents, validated demand

Modular RAG (Enterprise):

•Time: 6-8 weeks
•Cost: $500-2000+/month (multiple KBs, specialized modules)
•When: Enterprise, 1M+ documents, mission-critical

Decision Tree: Do You Really Need RAG?

code

Do users need to search your content?
│
├─ No → Don't build RAG ❌
│
└─ Yes
   ├─ <50 items? → FAQ page ✅ ($0)
   │
   └─ >50 items?
      ├─ Keyword search enough? → Use Algolia ✅ ($0-20/mo)
      │
      └─ Need semantic understanding?
         ├─ <50k docs? → Simple semantic (pgvector) ✅ ($30/mo)
         │
         └─ >50k docs?
            ├─ Validated with users? → Build RAG ✅
            └─ Not validated? → Test with Concierge MVP first ⚠️

Validation Checklist

Only proceed with RAG implementation if:

• Tested simpler alternatives (FAQ, keyword search, manual curation)
• Users confirmed they need AI-powered search (not just you think they do)
• Calculated ROI: cost of RAG < value users get
• Have >50k documents OR complex semantic search requirements
• Budget: $200-500/month for infrastructure
• Time: 3-4 weeks for production implementation

If any checkbox is unchecked: Go back to product-strategist or mvp-builder skills to validate first.

See also: PLAYBOOKS/validation-first-development.md for step-by-step validation process.

8-Phase RAG Implementation

Phase 1: Knowledge Base Design

Goal: Create well-structured knowledge foundation

Actions:

•Map data sources (internal: docs, databases, APIs / external: web, feeds)
•Filter noise, select authoritative content (prevent "data dump fallacy")
•Define chunking strategy: semantic chunking based on structure
•Add metadata: tags, timestamps, source identifiers, categories

Validation:

• All data sources catalogued and prioritized
• Data quality assessed (accuracy, completeness, freshness)
• Chunking strategy tested with sample documents
• Metadata schema validated for search effectiveness

Common Chunking Strategies:

•Fixed-size: 500-1000 tokens, 50-100 token overlap
•Semantic: By paragraph, section headers, or topic boundaries
•Recursive: Split by structure (markdown headers, code blocks)

Phase 2: Embedding Strategy

Goal: Choose optimal embedding approach for semantic understanding

Actions:

•Select embedding model: text-embedding-3-large (1536 dim) for general, domain-specific for specialized
•Plan multi-modal needs (text, code, images, tables)
•Decide on fine-tuning: use domain data if general embeddings underperform
•Establish similarity benchmarks

Validation:

• Embedding model benchmarked on domain data
• Retrieval accuracy tested with known query-document pairs
• Storage and compute costs validated

Model Selection:

•General: OpenAI text-embedding-3-large, text-embedding-3-small
•Code: code-search-babbage-code-001 or StarEncoder
•Multilingual: multilingual-e5-large

Phase 3: Vector Store Architecture

Goal: Implement scalable vector database

Actions:

•Choose vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector)
•Configure index: HNSW for speed, IVF for scale
•Plan scalability: data growth and query volume
•Implement backup, recovery, security

Validation:

• Vector store benchmarked under expected load
• Index optimized for retrieval speed and accuracy
• Backup and recovery tested
• Security controls implemented

Vector DB Decision:

•Managed cloud → Pinecone
•Self-hosted, feature-rich → Weaviate
•Lightweight, local → Chroma
•Cost-conscious → pgvector (Postgres extension)
•High-performance → Qdrant

Phase 4: Retrieval Pipeline

Goal: Build sophisticated retrieval beyond simple similarity search

Actions:

•Implement hybrid retrieval: semantic search + keyword (BM25)
•Add query enhancement: expansion, reformulation, multi-query
•Apply contextual filtering: metadata, temporal constraints, relevance ranking
•Design for query types: factual (precision), analytical (breadth), creative (diversity)
•Handle edge cases: no relevant results found

Advanced Techniques:

•Re-ranking: Use cross-encoder after initial retrieval (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2)
•Query routing: Route different query types to specialized strategies
•Ensemble methods: Combine multiple retrieval approaches
•Adaptive retrieval: Adjust top-k based on query complexity

Validation:

• Retrieval accuracy tested across diverse query types
• Hybrid retrieval outperforms single-method baselines
• Query latency meets requirements (<500ms ideal)
• Edge cases and fallbacks tested

Phase 5: Context Assembly

Goal: Transform retrieved chunks into optimal LLM context

Actions:

•Rank and select: prioritize by relevance score, recency, source authority
•Synthesize: merge related chunks, avoid redundancy
•Compress: use LLMLingua or similar for token optimization
•Mitigate "lost in the middle": place critical info at start/end
•Adapt dynamically: adjust context based on conversation history

Context Engineering Integration:

•Blend RAG results with system instructions and user prompts
•Maintain conversation coherence across multi-turn interactions
•Implement context persistence for follow-up queries
•Balance context size vs. information density

Validation:

• Context relevance validated against human judgments
• Token optimization maintains accuracy
• Multi-turn conversations maintain coherence
• Assembly latency <200ms

Phase 6: Evaluation & Metrics

Goal: Measure RAG system performance comprehensively

Retrieval Quality:

•Precision@K: Fraction of top-K results that are relevant
•Recall@K: Fraction of relevant docs in top-K
•MRR (Mean Reciprocal Rank): Average rank of first relevant result
•NDCG: Ranking quality with graded relevance

Generation Quality:

•Faithfulness: Generated content accuracy vs. sources
•Answer Relevance: Response relevance to query
•Context Utilization: How effectively LLM uses retrieved info
•Hallucination Rate: Frequency of unsupported claims

System Performance:

•End-to-End Latency: Query to answer (<3 seconds target)
•Retrieval Latency: Time to retrieve and rank (<500ms)
•Token Efficiency: Information density per token
•Cost Per Query: Combined retrieval + generation costs

Validation:

• Baseline metrics established
• A/B testing framework for config comparisons
• Automated evaluation pipeline deployed
• Human evaluation protocols for ground truth

Phase 7: Production Deployment

Goal: Deploy with enterprise-grade reliability and security

Deployment:

•Containerize with Docker/Kubernetes
•Implement load balancing across RAG instances
•Add caching for frequent queries
•Graceful degradation: fallback to base model on component failure

Security:

•Role-based access controls for knowledge base
•Data masking and PII protection
•Audit logging for compliance
•Prompt injection defense

Monitoring:

•Real-time metrics dashboard (latency, cost, accuracy)
•Query analysis for patterns and failure modes
•Cost tracking and optimization alerts
•Performance profiling for bottlenecks

Validation:

• Production handles expected traffic
• Security prevents unauthorized access
• Monitoring provides actionable insights
• Incident response procedures tested

Phase 8: Continuous Improvement

Goal: Establish processes for ongoing enhancement

Data Pipeline:

•Automated knowledge base updates (real-time or scheduled)
•Quality monitoring: detect data drift and degradation
•Source diversification: add new data sources
•Feedback integration: user corrections and preferences

Model Evolution:

•Evaluate and migrate to improved embeddings
•Fine-tune on domain data regularly
•Upgrade architecture: Naive → Advanced → Modular RAG
•Expand multi-modal support (images, audio, video)

Optimization:

•Analyze query patterns, optimize for common needs
•Improve cache hit rates
•Tune vector indices regularly
•Balance performance vs. costs

Validation:

• Automated improvement pipelines functioning
• Performance trends show improvement
• User satisfaction increasing
• System adapts to changing needs

Key RAG Principles

1. Relevance Over Volume

•Quality curation > massive datasets
•Remove outdated/low-quality content continuously
•Prioritize most relevant info to prevent "lost in the middle"

2. Semantic Understanding

•Use embeddings for true semantic matching, not just keywords
•Recognize query intent (factual, analytical, creative)
•Adapt retrieval strategy based on context

3. Multi-Modal Intelligence

•Handle text, images, code, tables, structured data
•Enable cross-modal retrieval (text query → image results)
•Preserve document structure and formatting

4. Temporal Awareness

•Prioritize recent info for time-sensitive topics
•Maintain historical access when relevant
•Integrate real-time data feeds for dynamic domains

5. Transparency & Trust

•Always provide source citations
•Indicate confidence levels
•Explain why specific information was selected

Standard RAG Response Format

json

{
  "answer": "Generated response incorporating retrieved information",
  "sources": [
    {
      "content": "Retrieved text chunk",
      "source": "Document/URL identifier",
      "relevance_score": 0.95,
      "chunk_id": "unique_identifier"
    }
  ],
  "confidence": 0.87,
  "retrieval_metadata": {
    "chunks_retrieved": 5,
    "retrieval_time_ms": 150,
    "generation_time_ms": 800
  }
}

Critical Success Rules

Non-Negotiable:

•✅ Source attribution for every response
•✅ Validate generated content against sources (prevent hallucination)
•✅ Filter sensitive data before retrieval
•✅ Respond within latency thresholds (<3 seconds)
•✅ Monitor and optimize costs continuously
•✅ Comply with security policies
•✅ Graceful degradation on failures
•✅ Comprehensive testing before production

Quality Gates:

•Before Production: >85% accuracy on evaluation dataset
•Ongoing: User satisfaction >4.0/5.0
•Performance: 95th percentile <5 seconds
•Reliability: 99.5% uptime
•Cost: Within 10% of budget

Advanced Patterns

Modular RAG Architecture

•Search Module: Query understanding and reformulation
•Memory Module: Long-term conversation persistence
•Routing Module: Query routing to specialized knowledge bases
•Predict Module: Anticipatory pre-loading based on context

Hybrid RAG + Fine-tuning

•RAG for dynamic, frequently changing knowledge
•Fine-tuning for domain-specific reasoning patterns
•Combine strengths for maximum effectiveness

Related Resources

Related Skills:

•multi-agent-architect - For complex RAG orchestration
•knowledge-graph-builder - For structured knowledge integration
•performance-optimizer - For RAG system optimization

Related Patterns:

•META/DECISION-FRAMEWORK.md - Vector DB and embedding selection
•STANDARDS/architecture-patterns/rag-pattern.md - RAG architecture details (when created)

Related Playbooks:

•PLAYBOOKS/deploy-rag-system.md - RAG deployment procedure (when created)