Building RAG Pipelines

Goal

Create a RAG system that achieves >90% retrieval precision, supports iterative reasoning via tools, learns from usage patterns, and respects user privacy preferences.

When to Use

•Building an AI assistant that needs to answer questions from a document corpus
•Implementing a knowledge base with natural language query interface
•Adding AI-powered search to an existing application

Instructions

Step 1: Design the Storage Tiers

Implement tiered storage to optimize for different access patterns:

python

# Tier 0: Cold - Raw files on disk (archives, uploads)
# Tier 1: Warm - Chunked text in SQLite with metadata
# Tier 2: Hot - Vector embeddings in ChromaDB
# Tier 3: Cache - LRU in-memory for frequent chunks

class Chunk(db.Model):
    chunk_id = db.Column(db.String(64), primary_key=True)
    content = db.Column(db.Text, nullable=False)
    source_file = db.Column(db.String(500), index=True)
    source_type = db.Column(db.String(50), index=True)  # log, config, etc.
    artifact_category = db.Column(db.String(50), index=True)
    token_count = db.Column(db.Integer)

Step 2: Implement Hybrid Search

Combine dense (vector) and sparse (BM25) retrieval:

python

def hybrid_search(query: str, top_k: int = 10) -> list[Chunk]:
    # Dense: Semantic similarity via embeddings
    vector_results = collection.query(query_texts=[query], n_results=top_k * 2)
    
    # Sparse: Keyword matching via BM25
    bm25_results = bm25_index.search(query, top_k * 2)
    
    # Score fusion with RRF (Reciprocal Rank Fusion)
    return reciprocal_rank_fusion(vector_results, bm25_results, k=60)

Step 3: Add Cross-Encoder Reranking

Rerank candidates for precision:

python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_chunks(query: str, chunks: list, top_k: int = 10) -> list:
    pairs = [(query, chunk['text']) for chunk in chunks]
    scores = reranker.predict(pairs)
    
    for chunk, score in zip(chunks, scores):
        chunk['cross_encoder_score'] = float(score)
    
    return sorted(chunks, key=lambda x: x['cross_encoder_score'], reverse=True)[:top_k]

Step 4: Implement Query Enhancement

Use LLM to expand queries for better recall:

python

def rewrite_query(query: str) -> str:
    prompt = f"""Expand this search query with related terms:
    Query: {query}
    
    Add synonyms, related concepts, and domain-specific terminology.
    Return expanded query as space-separated terms."""
    
    return llm.generate(prompt)

def generate_hyde_document(query: str) -> str:
    """Generate hypothetical document that would answer the query."""
    prompt = f"""Generate a document excerpt that would answer: {query}
    
    Write as if you're quoting from the actual source material."""
    
    return llm.generate(prompt)

Step 5: Extract and Index Entities

Enable entity-aware retrieval:

python

import re

PATTERNS = {
    'ipv4': re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'),
    'filepath': re.compile(r'(?:/[\w.-]+)+'),
    'username': re.compile(r'user[=:\s]+(\w+)', re.IGNORECASE),
}

def extract_entities(text: str) -> list[Entity]:
    entities = []
    for entity_type, pattern in PATTERNS.items():
        for match in pattern.finditer(text):
            entities.append(Entity(
                entity_type=entity_type,
                value=match.group(),
                context=text[max(0, match.start()-50):match.end()+50]
            ))
    return entities

Step 6: Build Agentic RAG

Let the LLM decide what to search:

python

AGENT_TOOLS = [
    {"name": "search_chunks", "description": "Search documents"},
    {"name": "search_entity", "description": "Find by IP/user/file"},
    {"name": "traverse_graph", "description": "Explore relationships"},
    {"name": "final_answer", "description": "Provide final response"}
]

def agent_loop(query: str, max_iterations: int = 5):
    history = []
    for i in range(max_iterations):
        response = llm.generate(build_agent_prompt(query, history))
        tool, params = parse_tool_call(response)
        
        if tool == "final_answer":
            return params["answer"]
        
        result = execute_tool(tool, params)
        history.append({"action": tool, "result": result})

Step 7: Add Relevance Feedback

Learn from LLM usage patterns:

python

def record_usage(chunks: list, response: str, query: str):
    for chunk in chunks:
        # Detect if chunk was cited in response
        if chunk['source_file'] in response.lower():
            chunk_relevance.citation_count += 1
        # Detect content overlap
        elif phrase_overlap(chunk['text'], response) > 0.3:
            chunk_relevance.usage_count += 1
    
    # Update relevance score
    chunk_relevance.score = citations * 1.0 + usages * 0.5

Constraints

✅ Do

•DO: Use hybrid search (vector + BM25) for robustness
•DO: Apply cross-encoder reranking for precision
•DO: Extract entities at ingestion time (fast, deterministic)
•DO: Stream LLM responses for better UX
•DO: Track which chunks are actually used (relevance feedback)
•DO: Provide privacy warnings for cloud LLM providers

❌ Don't

•DON'T: Skip reranking — first-stage retrieval is noisy
•DON'T: Use fixed top_k — adapt to query complexity
•DON'T: Call LLM during entity extraction — too slow
•DON'T: Build entity graphs at query time — do it at ingestion
•DON'T: Ignore privacy — mark local vs cloud providers clearly
•DON'T: Hardcode chunking — allow overlap and context windows

Output Format

A complete RAG service should provide:

•ingest(files) → Chunk, embed, extract entities, build graph
•query(text) → Retrieve, rerank, generate response
•query_agent(text) → Iterative search with reasoning
•get_entities(type) → List extracted entities
•get_relevance_stats() → View learning progress

Dependencies

•../backend/scaffolding-flask/SKILL.md — API structure
•../database/designing-schemas/SKILL.md — Model design

References

Reference	Description
chunking-strategies.md	Document chunking patterns, token budgets, and overlap strategies
embedding-models.md	Model comparison, hybrid search, and BM25 integration
agentic-patterns.md	ReAct agent loops, tool design, and iterative reasoning
graph-rag.md	Entity relationship graphs, traversal algorithms, kill chain analysis
relevance-feedback.md	Learning from usage patterns, citation detection, score boosting