RAG Implementation Patterns

Retrieval-Augmented Generation patterns for building reliable, contextual chatbots with embeddings and vector search.

When to Apply

Use this skill when:

•Implementing document ingestion and chunking
•Setting up vector stores (SQLite, in-memory, Pinecone, etc.)
•Implementing semantic search with embeddings
•Preventing LLM hallucinations through context grounding
•Optimizing retrieval performance and accuracy
•Building knowledge-based chatbots

Key Patterns

1. Semantic Chunking Strategy (CRITICAL)

Pattern: Chunk by semantic boundaries with overlap for context preservation

typescript

// lib/rag/chunking.ts
import { v4 as uuidv4 } from 'uuid'

export interface ChunkMetadata {
  section?: string
  title?: string
  wordCount?: number
}

export interface Chunk {
  id: string
  text: string
  source_file: string
  metadata: ChunkMetadata
}

/**
 * Chunk document by semantic boundaries (headers, paragraphs)
 * with overlap to preserve context
 */
export function chunkDocument(
  content: string,
  filename: string,
  options = {
    chunkSize: 500,      // words per chunk
    overlapSize: 50,     // overlapping words
    splitByHeaders: true // use ## headers as boundaries
  }
): Chunk[] {
  const chunks: Chunk[] = []
  
  if (options.splitByHeaders) {
    // Split by markdown headers (## or ###)
    const sections = content.split(/^#{2,3}\s+(.+)$/m)
    
    for (let i = 0; i < sections.length; i += 2) {
      const title = sections[i]?.trim() || 'Introduction'
      const sectionContent = sections[i + 1]?.trim() || ''
      
      if (!sectionContent) continue
      
      // Chunk each section
      const sectionChunks = chunkText(sectionContent, {
        chunkSize: options.chunkSize,
        overlapSize: options.overlapSize
      })
      
      // Add metadata
      sectionChunks.forEach(text => {
        chunks.push({
          id: uuidv4(),
          text,
          source_file: filename,
          metadata: {
            section: title,
            wordCount: text.split(/\s+/).length
          }
        })
      })
    }
  } else {
    // Simple paragraph-based chunking
    const allChunks = chunkText(content, options)
    allChunks.forEach(text => {
      chunks.push({
        id: uuidv4(),
        text,
        source_file: filename,
        metadata: { wordCount: text.split(/\s+/).length }
      })
    })
  }
  
  return chunks
}

/**
 * Chunk text with sliding window overlap
 */
function chunkText(
  text: string,
  options: { chunkSize: number; overlapSize: number }
): string[] {
  const words = text.split(/\s+/).filter(Boolean)
  const chunks: string[] = []
  
  for (let i = 0; i < words.length; i += options.chunkSize - options.overlapSize) {
    const chunk = words.slice(i, i + options.chunkSize).join(' ')
    if (chunk.trim()) {
      chunks.push(chunk)
    }
  }
  
  return chunks
}

2. Embedding Generation (CRITICAL)

Pattern: Batch embeddings with OpenAI for efficiency and cost optimization

typescript

// lib/rag/embeddings.ts
import { OpenAI } from 'openai'

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
})

const EMBEDDING_MODEL = 'text-embedding-3-small' // 1536 dimensions, cheap
const BATCH_SIZE = 100 // Max embeddings per request

/**
 * Generate embeddings for multiple texts in batches
 */
export async function generateEmbeddings(
  texts: string[]
): Promise<number[][]> {
  const embeddings: number[][] = []
  
  // Process in batches to respect API limits
  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE)
    
    console.log(`Generating embeddings for batch ${i / BATCH_SIZE + 1}...`)
    
    try {
      const response = await openai.embeddings.create({
        model: EMBEDDING_MODEL,
        input: batch
      })
      
      // Extract embeddings in correct order
      const batchEmbeddings = response.data
        .sort((a, b) => a.index - b.index)
        .map(item => item.embedding)
      
      embeddings.push(...batchEmbeddings)
      
      // Rate limiting: wait between batches
      if (i + BATCH_SIZE < texts.length) {
        await new Promise(resolve => setTimeout(resolve, 100))
      }
      
    } catch (error) {
      console.error(`Failed to generate embeddings for batch ${i}:`, error)
      throw error
    }
  }
  
  return embeddings
}

/**
 * Generate embedding for single text (for queries)
 */
export async function generateEmbedding(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: text
  })
  
  return response.data[0].embedding
}

3. Vector Store (SQLite) (HIGH)

Pattern: Local SQLite database with JSON storage for embeddings

typescript

// lib/rag/store.ts
import Database from 'better-sqlite3'
import path from 'path'
import fs from 'fs'

const DB_PATH = path.join(process.cwd(), 'data', 'vector_store.db')

export interface StoredChunk {
  id: string
  text: string
  source_file: string
  embedding: number[]
  metadata: string // JSON
}

/**
 * Initialize SQLite database
 */
export function initDB(): Database.Database {
  // Ensure data directory exists
  const dataDir = path.dirname(DB_PATH)
  if (!fs.existsSync(dataDir)) {
    fs.mkdirSync(dataDir, { recursive: true })
  }
  
  const db = new Database(DB_PATH)
  
  // Create table
  db.exec(`
    CREATE TABLE IF NOT EXISTS chunks (
      id TEXT PRIMARY KEY,
      text TEXT NOT NULL,
      source_file TEXT NOT NULL,
      embedding_json TEXT NOT NULL,
      metadata_json TEXT NOT NULL,
      created_at DATETIME DEFAULT CURRENT_TIMESTAMP
    )
  `)
  
  // Create index on source_file for faster queries
  db.exec(`
    CREATE INDEX IF NOT EXISTS idx_source_file 
    ON chunks(source_file)
  `)
  
  return db
}

/**
 * Insert chunk with embedding
 */
export function insertChunk(
  db: Database.Database,
  chunk: Chunk,
  embedding: number[]
): void {
  const stmt = db.prepare(`
    INSERT INTO chunks (id, text, source_file, embedding_json, metadata_json)
    VALUES (?, ?, ?, ?, ?)
  `)
  
  stmt.run(
    chunk.id,
    chunk.text,
    chunk.source_file,
    JSON.stringify(embedding),
    JSON.stringify(chunk.metadata)
  )
}

/**
 * Get all chunks (for similarity search)
 */
export function getAllChunks(db: Database.Database): StoredChunk[] {
  const stmt = db.prepare(`
    SELECT id, text, source_file, embedding_json, metadata_json
    FROM chunks
  `)
  
  const rows = stmt.all() as any[]
  
  return rows.map(row => ({
    id: row.id,
    text: row.text,
    source_file: row.source_file,
    embedding: JSON.parse(row.embedding_json),
    metadata: row.metadata_json
  }))
}

/**
 * Clear all chunks (for re-ingestion)
 */
export function clearDB(db: Database.Database): void {
  db.exec('DELETE FROM chunks')
}

/**
 * Get chunk count
 */
export function getChunkCount(db: Database.Database): number {
  const result = db.prepare('SELECT COUNT(*) as count FROM chunks').get() as any
  return result.count
}

4. Semantic Search (CRITICAL)

Pattern: Cosine similarity with top-K retrieval

typescript

// lib/rag/search.ts

export interface RetrievalResult {
  text: string
  source_file: string
  similarity: number
  metadata?: any
}

/**
 * Calculate cosine similarity between two vectors
 */
export function cosineSimilarity(a: number[], b: number[]): number {
  if (a.length !== b.length) {
    throw new Error('Vectors must have same length')
  }
  
  let dotProduct = 0
  let normA = 0
  let normB = 0
  
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i]
    normA += a[i] * a[i]
    normB += b[i] * b[i]
  }
  
  const denominator = Math.sqrt(normA) * Math.sqrt(normB)
  
  if (denominator === 0) return 0
  
  return dotProduct / denominator
}

/**
 * Search for most similar chunks
 */
export async function searchSimilar(
  db: Database.Database,
  queryEmbedding: number[],
  topK: number = 3,
  minSimilarity: number = 0.5
): Promise<RetrievalResult[]> {
  // Get all chunks
  const chunks = getAllChunks(db)
  
  // Calculate similarities
  const scored = chunks.map(chunk => ({
    text: chunk.text,
    source_file: chunk.source_file,
    similarity: cosineSimilarity(queryEmbedding, chunk.embedding),
    metadata: chunk.metadata
  }))
  
  // Filter by minimum similarity and sort
  const results = scored
    .filter(result => result.similarity >= minSimilarity)
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, topK)
  
  return results
}

/**
 * Retrieve relevant context for a query
 */
export async function retrieve(
  query: string,
  topK: number = 3
): Promise<RetrievalResult[]> {
  const db = initDB()
  
  try {
    // Generate query embedding
    const queryEmbedding = await generateEmbedding(query)
    
    // Search for similar chunks
    const results = await searchSimilar(db, queryEmbedding, topK)
    
    return results
  } finally {
    db.close()
  }
}

5. Anti-Hallucination System Prompt (CRITICAL)

Pattern: Strict constraints with explicit "I don't know" instructions

typescript

// lib/llm/prompts.ts

export const SYSTEM_PROMPT = `
Eres un asistente experto en Camaral, plataforma de humanos digitales y avatares de IA.

REGLAS CRÍTICAS - DEBES SEGUIR ESTAS REGLAS SIEMPRE:

1. CONTEXTO ES TU ÚNICA FUENTE DE VERDAD
   - SOLO responde basándote en el CONTEXTO proporcionado
   - NO uses conocimiento general o información externa
   - Si algo no está en el contexto, di "No tengo información sobre eso"

2. PROHIBIDO INVENTAR
   - NO inventes precios, costos o planes de suscripción
   - NO menciones clientes que no estén en el contexto
   - NO inventes métricas, estadísticas o números
   - NO inventes integraciones o características técnicas
   - NO prometas funcionalidades no mencionadas

3. TRANSPARENCIA CUANDO NO SABES
   - Si no tienes información suficiente, dilo explícitamente
   - Ejemplo: "No cuento con información sobre [tema] en mi base de conocimiento"
   - Sugiere dónde pueden obtener más información (página web, contacto)

4. CITAS Y ATRIBUCIÓN
   - Cuando sea posible, menciona la fuente: "Según [nombre del documento]..."
   - Esto genera confianza y permite verificación

5. TONO Y ESTILO
   - Profesional, claro y confiable
   - Prioriza claridad sobre longitud
   - Respuestas concisas pero completas
   - Lenguaje accesible, no técnico-comercial excesivo

6. REDIRECCIÓN APROPIADA
   - Si preguntan fuera del contexto de Camaral, redirige amablemente
   - Ejemplo: "Soy un asistente especializado en Camaral. Para esa pregunta..."

RECUERDA: Es mejor decir "No sé" que inventar información incorrecta.
`.trim()

/**
 * Build prompt with retrieved context
 */
export function buildPromptWithContext(
  chunks: RetrievalResult[],
  question: string
): string {
  // Format context from retrieved chunks
  const context = chunks
    .map((chunk, i) => `
[Fuente ${i + 1}: ${chunk.source_file}]
${chunk.text}
    `.trim())
    .join('\n\n---\n\n')
  
  return `${SYSTEM_PROMPT}

═══════════════════════════════════════════════════════
CONTEXTO PROPORCIONADO:
═══════════════════════════════════════════════════════

${context}

═══════════════════════════════════════════════════════
PREGUNTA DEL USUARIO:
═══════════════════════════════════════════════════════

${question}

═══════════════════════════════════════════════════════
TU RESPUESTA (basada SOLO en el contexto):
═══════════════════════════════════════════════════════
`
}

6. Complete RAG Pipeline (HIGH)

Pattern: End-to-end retrieval-augmented generation

typescript

// app/api/chat/route.ts
import { NextRequest, NextResponse } from 'next/server'
import { retrieve } from '@/lib/rag/search'
import { buildPromptWithContext } from '@/lib/llm/prompts'
import { OpenAI } from 'openai'

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
})

export async function POST(req: NextRequest) {
  try {
    const { message, history } = await req.json()
    
    // 1. Retrieve relevant chunks from knowledge base
    const chunks = await retrieve(message, 3)
    
    console.log(`Retrieved ${chunks.length} chunks with similarities:`, 
      chunks.map(c => c.similarity.toFixed(3))
    )
    
    // 2. Build prompt with context
    const systemPrompt = buildPromptWithContext(chunks, message)
    
    // 3. Generate response with LLM
    const completion = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [
        { role: 'system', content: systemPrompt },
        ...history.slice(-10), // Last 10 messages for context
        { role: 'user', content: message }
      ],
      temperature: 0.3,
      max_tokens: 800
    })
    
    const response = completion.choices[0].message.content
    
    // 4. Extract unique sources for attribution
    const sources = [...new Set(chunks.map(c => c.source_file))]
    
    // 5. Return response with sources
    return NextResponse.json({
      response,
      sources,
      metadata: {
        model: 'gpt-4o-mini',
        chunks_used: chunks.length,
        avg_similarity: chunks.reduce((sum, c) => sum + c.similarity, 0) / chunks.length
      }
    })
    
  } catch (error) {
    console.error('RAG pipeline error:', error)
    return NextResponse.json(
      { error: 'Failed to generate response' },
      { status: 500 }
    )
  }
}

Anti-Patterns

❌ Don't: Use keyword search instead of semantic search

typescript

// BAD: Simple string matching
const relevantChunks = allChunks.filter(chunk =>
  chunk.text.toLowerCase().includes(query.toLowerCase())
)

✅ Do: Use semantic embeddings

typescript

// GOOD: Semantic similarity
const queryEmbedding = await generateEmbedding(query)
const relevantChunks = await searchSimilar(db, queryEmbedding, topK)

❌ Don't: Send all documents as context

typescript

// BAD: Context too large, expensive
const allDocs = readAllMarkdownFiles()
const prompt = `Context: ${allDocs.join('\n\n')}\nQuestion: ${query}`

✅ Do: Retrieve only relevant chunks

typescript

// GOOD: Targeted, cost-effective
const relevantChunks = await retrieve(query, 3)
const prompt = buildPromptWithContext(relevantChunks, query)

Performance Tips

•Batch embeddings - Process 100 texts per API call
•Use smaller model - text-embedding-3-small is cheap and effective
•Cache query embeddings - Same queries → reuse embeddings
•Limit topK - 3-5 chunks usually sufficient
•Add similarity threshold - Filter out low-relevance chunks (< 0.5)
•Index frequently - Re-ingest when knowledge base changes
•Monitor costs - Log embedding API calls

Testing

typescript

// Test chunking
describe('chunkDocument', () => {
  it('should split by headers', () => {
    const content = '## Section 1\nContent...\n## Section 2\nMore...'
    const chunks = chunkDocument(content, 'test.md')
    expect(chunks.length).toBeGreaterThan(0)
    expect(chunks[0].metadata.section).toBe('Section 1')
  })
})

// Test similarity
describe('cosineSimilarity', () => {
  it('should return 1 for identical vectors', () => {
    const v = [1, 2, 3]
    expect(cosineSimilarity(v, v)).toBeCloseTo(1)
  })
  
  it('should return 0 for orthogonal vectors', () => {
    expect(cosineSimilarity([1, 0], [0, 1])).toBeCloseTo(0)
  })
})

rag-implementation-patterns

RAG Implementation Patterns

When to Apply

Key Patterns

1. Semantic Chunking Strategy (CRITICAL)

2. Embedding Generation (CRITICAL)

3. Vector Store (SQLite) (HIGH)

4. Semantic Search (CRITICAL)

5. Anti-Hallucination System Prompt (CRITICAL)

6. Complete RAG Pipeline (HIGH)

Anti-Patterns

❌ Don't: Use keyword search instead of semantic search

✅ Do: Use semantic embeddings

❌ Don't: Send all documents as context

✅ Do: Retrieve only relevant chunks

Performance Tips

Testing

References