RAG Architecture
Purpose
Design a Retrieval-Augmented Generation pipeline, including document processing, chunking strategy, embedding pipeline, vector database selection, retrieval optimization, and context assembly.
Inputs
- •Source documents (type, volume, update frequency)
- •Query patterns (user questions, search terms, structured queries)
- •Quality requirements (relevance threshold, hallucination tolerance)
- •Latency requirements (real-time, near-real-time, batch)
- •Cost constraints (embedding costs, storage costs, query costs)
Process
Step 1: Analyze Source Documents
Understand what's being indexed:
- •Document types: PDFs, web pages, code, structured data, conversations
- •Volume: Number of documents, total size, growth rate
- •Update frequency: Static corpus, daily updates, real-time
- •Structure: Highly structured (tables, headers) vs unstructured (prose, transcripts)
- •Quality: Clean text vs noisy (OCR artifacts, HTML remnants, duplicates)
Step 2: Design Chunking Strategy
Choose how to split documents:
- •Fixed-size: 500-1000 tokens with 100-200 token overlap. Simple but may split concepts.
- •Semantic: Split on paragraph/section boundaries. Preserves meaning but variable size.
- •Hierarchical: Parent-child chunks (section summary + detail chunks). Best for complex docs.
- •Recursive: Start large, recursively split until chunks fit size target.
- •Metadata enrichment: Attach source, section title, page number to each chunk.
Step 3: Select Embedding Model
Choose the embedding approach:
- •OpenAI text-embedding-3-small/large: Best general-purpose, 1536/3072 dimensions
- •Cohere embed-v3: Strong multilingual, supports search and classification modes
- •Open source (BGE, E5): Self-hosted, lower cost at scale, variable quality
- •Considerations: Dimension size (storage), context length, multilingual support, cost per token
Step 4: Select Vector Database
Choose storage and retrieval:
| Database | Hosted | Open Source | Hybrid Search | Best For |
|---|---|---|---|---|
| Pinecone | Yes | No | Yes (sparse+dense) | Production, managed |
| Weaviate | Yes | Yes | Yes (BM25+vector) | Self-hosted, rich filtering |
| ChromaDB | No | Yes | No | Prototyping, local dev |
| pgvector | Via Supabase | Yes | BM25 separate | Already using Postgres |
| Qdrant | Yes | Yes | Yes | High-performance, filtering |
Step 5: Design Retrieval Pipeline
Build the query-time pipeline:
- •Query preprocessing: Expand abbreviations, detect intent, generate sub-queries
- •Embedding: Encode query with same model used for documents
- •Initial retrieval: Top-K vector search (K=20-50)
- •Reranking: Cross-encoder reranker to reorder by relevance (return top 5-10)
- •Context assembly: Combine retrieved chunks into a prompt, add metadata
- •Generation: LLM call with assembled context + user query
Step 6: Design Quality Metrics
Define how to measure RAG quality:
- •Retrieval metrics: Recall@K (are relevant docs in top K?), MRR (is the best doc ranked first?)
- •Generation metrics: Faithfulness (does the answer stick to context?), relevance (does it answer the question?)
- •End-to-end: Answer accuracy on golden dataset, hallucination rate
- •Monitoring: Track retrieval scores over time, flag low-confidence answers
Output Format
markdown
# RAG Architecture ## Source Analysis | Attribute | Value | |-----------|-------| | Document types | [Types] | | Corpus size | [Size] | | Update frequency | [Frequency] | ## Chunking Strategy **Method:** [Fixed/Semantic/Hierarchical] **Target chunk size:** [X tokens] **Overlap:** [X tokens] **Metadata:** [Fields attached to each chunk] ## Embedding Pipeline **Model:** [Name] **Dimensions:** [N] **Cost:** [$X per 1M tokens] **Batch processing:** [Strategy for initial load vs incremental updates] ## Vector Database **Choice:** [Database] **Rationale:** [Why this DB] **Index configuration:** [HNSW params, quantization, etc.] **Hybrid search:** [BM25 + vector approach] ## Retrieval Pipeline
Query → [Preprocess] → [Embed] → [Vector Search (top 20)] → [Rerank (top 5)] → [Assemble Context] → [LLM] → [Validate] → Response
code
| Stage | Latency | Cost | |-------|---------|------| | Embedding | Xms | $X | | Vector search | Xms | $X | | Reranking | Xms | $X | | Generation | Xms | $X | | **Total** | **Xms** | **$X** | ## Quality Metrics | Metric | Target | Measurement | |--------|--------|-------------| | Recall@10 | >90% | Golden dataset | | Faithfulness | >95% | Automated scoring | | Hallucination rate | <5% | Reference checking | ## Cost Model | Component | Monthly Cost (at X queries/day) | |-----------|-------------------------------| | Embeddings | $X | | Vector DB | $X | | Reranking | $X | | Generation | $X | | **Total** | **$X** |
Quality Checks
- • Chunking strategy is justified against document structure (not just default 500 tokens)
- • Embedding model matches the query language and domain
- • Retrieval pipeline includes reranking (not just raw vector similarity)
- • Cost model accounts for both indexing and query costs
- • Quality metrics have defined targets and measurement approach
- • Update strategy handles incremental changes (not re-index everything)
- • Latency budget is broken down by pipeline stage