Optimizing RAG Performance
Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.
When to Use This Skill
- •Improving retrieval accuracy and quality
- •Adding reranking to existing RAG pipeline
- •Reducing latency and improving response time
- •Deploying RAG to production
- •Implementing caching for faster re-processing
- •Scaling to handle more documents or queries
- •Optimizing costs (API usage, compute resources)
Quick Wins (Low Effort, High Impact)
1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)
Benefit: Transforms any embedding into competitive performance Effort: Minimal code change
from llama_index.postprocessor.cohere_rerank import CohereRerank
# For English content
reranker = CohereRerank(
top_n=5,
model="rerank-english-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# For Thai/multilingual content
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# Apply to query engine
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve more candidates
node_postprocessors=[reranker] # Rerank to top 5
)
Best Practice: Always retrieve 10x candidates, rerank to final top_k
2. Enable Parallel Loading → 13x Speedup (30 minutes)
Benefit: 391s → 31s for 32 PDF files Effort: One parameter change
from llama_index.core import SimpleDirectoryReader
# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()
# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
num_workers=10 # Adjust based on CPU cores
)
3. Optimize Batch Size → Faster Embeddings (15 minutes)
Benefit: Reduced API calls, better throughput Effort: Configuration change
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100 # Up from default 10
)
Quick Decision Guide
Reranker Selection
- •Best quality → CohereRerank (API-based, best performance)
- •Best open-source → bge-reranker-large (local, free)
- •Cost-effective → SentenceTransformerRerank (local, fast)
- •Multi-language → CohereRerank multilingual-v3.0
Caching Strategy
- •Development → Local cache (pipeline.persist)
- •Production → Redis cache (distributed)
- •When to clear → Model changes, schema updates
Production Deployment
- •Small scale (<1M docs) → Serverless (auto-scaling)
- •Large scale → Container-based (consistent performance)
- •Multi-tenant → Collection isolation + metadata filtering
Optimization Patterns
Pattern 1: Add Best-Practice Reranking
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
model="BAAI/bge-reranker-large",
top_n=5
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)
Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR
Pattern 2: Pipeline Caching (Local)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
# Create pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512),
OpenAIEmbedding()
]
)
# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")
# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents) # Only processes new/changed docs
Pattern 3: Redis Cache (Production)
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore
# Distributed caching
ingest_cache = IngestionCache(
cache=RedisKVStore.from_host_and_port(
host="redis-server",
port=6379
),
collection="rag_pipeline_cache"
)
pipeline = IngestionPipeline(
transformations=[...],
cache=ingest_cache
)
# Cache shared across instances
nodes = pipeline.run(documents=documents)
Pattern 4: Advanced Retrieval Strategies
Metadata Pre-filtering (Sub-50ms):
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="category", value="technical"),
ExactMatchFilter(key="year", value="2024")
]
)
query_engine = index.as_query_engine(
filters=filters, # Narrow search space first
similarity_top_k=5
)
Document Summary Retrieval (For 100+ docs):
from llama_index.core import DocumentSummaryIndex
# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
documents,
response_synthesizer=response_synthesizer
)
retriever = summary_index.as_retriever(similarity_top_k=3)
Chunk Decoupling (Precision + Context):
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # Sentences before/after
window_metadata_key="window"
)
# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
node_postprocessors=[postprocessor],
similarity_top_k=6
)
Pattern 5: Multiple Postprocessors Chain
from llama_index.core.postprocessor import SimilarityPostprocessor
# Progressive refinement
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.7), # Filter low scores
CohereRerank(top_n=10), # Rerank top candidates
MetadataReplacementPostProcessor(...) # Expand context
]
)
Your Codebase Integration
For src/ Pipeline
Add Reranking to All 7 Strategies:
- •
src/10_basic_query_engine.py→ Add CohereRerank - •
src/16_hybrid_search.py→ Add reranker after fusion - •All other strategies: Consistent reranking layer
Enable Caching:
- •
src/09_enhanced_batch_embeddings.py→ Add pipeline caching - •
src/02_prep_doc_for_embedding.py→ Cache preprocessing
For src-iLand/ Pipeline
Thai-Optimized Reranking:
# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0" # Thai support
)
Fast Metadata Filtering (Already implemented):
- •
src-iLand/retrieval/fast_metadata_index.py→ Sub-50ms filtering - •Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
- •B-tree indices for: area, coordinates
Batch Processing Optimization:
- •
src-iLand/docs_embedding/batch_embedding.py→ Increase batch_size - •
src-iLand/data_processing/→ Enable parallel loading
Detailed References
Load these when you need comprehensive details:
- •
reference-reranking.md: Complete reranking guide
- •6 reranker models with benchmarks
- •Multi-language reranking
- •Cost-performance trade-offs
- •Node postprocessor types
- •
reference-production.md: Production optimization patterns
- •Ingestion pipeline caching (local & Redis)
- •Parallel processing (13x speedup)
- •Vector store integration
- •Multi-tenancy patterns
- •Deployment architectures
- •Error handling and reliability
- •
reference-advanced-retrieval.md: Advanced retrieval strategies
- •Document summary retrieval
- •Recursive retrieval
- •Chunk decoupling
- •Sub-question decomposition
- •Fusion retrieval
- •Auto retriever
Common Workflows
Workflow 1: Add Reranking to Existing Pipeline
- •
Step 1: Choose reranker
- •Load
reference-reranking.mdfor comparison - •For Thai: CohereRerank multilingual-v3.0
- •For cost: bge-reranker-large
- •Load
- •
Step 2: Install dependencies
bashpip install llama-index-postprocessor-cohere-rerank # OR pip install llama-index-postprocessor-flag-embedding-reranker
- •
Step 3: Update query engine
- •Change
similarity_top_kfrom 5 → 10 - •Add reranker with
top_n=5
- •Change
- •
Step 4: Test impact
- •Compare retrieval quality before/after
- •Expected: 5-15% hit rate improvement
- •
Step 5: Deploy to all strategies
- •Apply consistently across retrievers
Workflow 2: Enable Production Caching
- •
Step 1: Choose caching backend
- •Development → Local cache
- •Production → Redis cache
- •
Step 2: Wrap pipeline
pythonpipeline = IngestionPipeline( transformations=[splitter, embedder], # cache=... (local or Redis) ) - •
Step 3: Initial run (builds cache)
pythonnodes = pipeline.run(documents=documents) pipeline.persist("./cache") # Local only - •
Step 4: Subsequent runs (uses cache)
- •Only processes new/changed documents
- •Massive speedup for re-runs
- •
Step 5: Monitor cache size
- •Clear when storage grows too large
- •Clear on model/schema changes
Workflow 3: Deploy to Production
- •
Step 1: Review production checklist
- •Load
reference-production.mdfor full guide
- •Load
- •
Step 2: Implement error handling
- •Retry logic for API calls
- •Fallback strategies for failures
- •Circuit breaker pattern
- •
Step 3: Add monitoring
- •Track retrieval latency (p50, p95, p99)
- •Monitor hit rate and MRR
- •Log embedding API usage
- •
Step 4: Set up caching
- •Redis for distributed systems
- •Query result caching
- •Embedding caching
- •
Step 5: Deploy with redundancy
- •Load balancing
- •Health checks
- •Graceful degradation
Workflow 4: Optimize for Scale (100+ Documents)
- •
Step 1: Add metadata filtering
- •Tag documents with categories
- •Filter before vector search (90% reduction)
- •
Step 2: Implement document summaries
- •Generate summaries for each document
- •Two-stage retrieval: doc → chunk
- •
Step 3: Enable fast metadata indexing
- •Build inverted indices for categorical fields
- •B-tree indices for numeric fields
- •See
src-iLand/retrieval/fast_metadata_index.py
- •
Step 4: Use async operations
pythonretriever = index.as_retriever(use_async=True)
- •
Step 5: Monitor and tune
- •Adjust top_k based on precision/recall
- •Optimize chunk size for domain
Performance Benchmarks
Reranking Impact (from reference docs)
| Embedding | Without Rerank | With Cohere Rerank | Improvement |
|---|---|---|---|
| OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% |
| JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% |
| bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |
Parallel Loading Impact
- •Sequential: 391 seconds (32 PDF files)
- •Parallel (10 workers): 31 seconds
- •Speedup: 13x faster
Caching Impact
- •First run: Full processing time
- •Cached run: Only new/changed documents
- •Typical speedup: 10-100x for repeat runs
Key Reminders
Reranking Best Practices:
- •Always retrieve 10x candidates, rerank to top-k
- •Use multilingual models for Thai content
- •Combine with hybrid search for best results
Caching Cautions:
- •Clear cache when changing embedding models
- •Clear cache when updating document schema
- •Monitor cache size growth
Production Essentials:
- •Implement retry logic and error handling
- •Monitor latency, hit rate, costs
- •Use distributed caching (Redis)
- •Enable async operations for parallel retrieval
Scripts
This skill includes utility scripts in the scripts/ directory:
validate_config.py
Validates RAG configuration before deployment:
python .claude/skills/optimizing-rag/scripts/validate_config.py \
--config-file ./config.yaml
Checks:
- •Chunk size appropriate for domain
- •Embedding model consistency
- •Top-k values reasonable
- •Reranker configuration
benchmark_performance.py
Measures retrieval performance:
python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
--index-path ./index \
--queries-file ./test_queries.txt
Reports:
- •Retrieval latency (p50, p95, p99)
- •Throughput (queries/second)
- •Memory usage
Next Steps
After optimizing:
- •Evaluate: Use
evaluating-ragskill to measure improvements with hit rate and MRR - •Monitor: Set up continuous evaluation in production
- •Iterate: Use metrics to guide further optimizations