RAG Implementation Patterns
Retrieval-Augmented Generation patterns for building reliable, contextual chatbots with embeddings and vector search.
When to Apply
Use this skill when:
- •Implementing document ingestion and chunking
- •Setting up vector stores (SQLite, in-memory, Pinecone, etc.)
- •Implementing semantic search with embeddings
- •Preventing LLM hallucinations through context grounding
- •Optimizing retrieval performance and accuracy
- •Building knowledge-based chatbots
Key Patterns
1. Semantic Chunking Strategy (CRITICAL)
Pattern: Chunk by semantic boundaries with overlap for context preservation
typescript
// lib/rag/chunking.ts
import { v4 as uuidv4 } from 'uuid'
export interface ChunkMetadata {
section?: string
title?: string
wordCount?: number
}
export interface Chunk {
id: string
text: string
source_file: string
metadata: ChunkMetadata
}
/**
* Chunk document by semantic boundaries (headers, paragraphs)
* with overlap to preserve context
*/
export function chunkDocument(
content: string,
filename: string,
options = {
chunkSize: 500, // words per chunk
overlapSize: 50, // overlapping words
splitByHeaders: true // use ## headers as boundaries
}
): Chunk[] {
const chunks: Chunk[] = []
if (options.splitByHeaders) {
// Split by markdown headers (## or ###)
const sections = content.split(/^#{2,3}\s+(.+)$/m)
for (let i = 0; i < sections.length; i += 2) {
const title = sections[i]?.trim() || 'Introduction'
const sectionContent = sections[i + 1]?.trim() || ''
if (!sectionContent) continue
// Chunk each section
const sectionChunks = chunkText(sectionContent, {
chunkSize: options.chunkSize,
overlapSize: options.overlapSize
})
// Add metadata
sectionChunks.forEach(text => {
chunks.push({
id: uuidv4(),
text,
source_file: filename,
metadata: {
section: title,
wordCount: text.split(/\s+/).length
}
})
})
}
} else {
// Simple paragraph-based chunking
const allChunks = chunkText(content, options)
allChunks.forEach(text => {
chunks.push({
id: uuidv4(),
text,
source_file: filename,
metadata: { wordCount: text.split(/\s+/).length }
})
})
}
return chunks
}
/**
* Chunk text with sliding window overlap
*/
function chunkText(
text: string,
options: { chunkSize: number; overlapSize: number }
): string[] {
const words = text.split(/\s+/).filter(Boolean)
const chunks: string[] = []
for (let i = 0; i < words.length; i += options.chunkSize - options.overlapSize) {
const chunk = words.slice(i, i + options.chunkSize).join(' ')
if (chunk.trim()) {
chunks.push(chunk)
}
}
return chunks
}
2. Embedding Generation (CRITICAL)
Pattern: Batch embeddings with OpenAI for efficiency and cost optimization
typescript
// lib/rag/embeddings.ts
import { OpenAI } from 'openai'
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
})
const EMBEDDING_MODEL = 'text-embedding-3-small' // 1536 dimensions, cheap
const BATCH_SIZE = 100 // Max embeddings per request
/**
* Generate embeddings for multiple texts in batches
*/
export async function generateEmbeddings(
texts: string[]
): Promise<number[][]> {
const embeddings: number[][] = []
// Process in batches to respect API limits
for (let i = 0; i < texts.length; i += BATCH_SIZE) {
const batch = texts.slice(i, i + BATCH_SIZE)
console.log(`Generating embeddings for batch ${i / BATCH_SIZE + 1}...`)
try {
const response = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: batch
})
// Extract embeddings in correct order
const batchEmbeddings = response.data
.sort((a, b) => a.index - b.index)
.map(item => item.embedding)
embeddings.push(...batchEmbeddings)
// Rate limiting: wait between batches
if (i + BATCH_SIZE < texts.length) {
await new Promise(resolve => setTimeout(resolve, 100))
}
} catch (error) {
console.error(`Failed to generate embeddings for batch ${i}:`, error)
throw error
}
}
return embeddings
}
/**
* Generate embedding for single text (for queries)
*/
export async function generateEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: text
})
return response.data[0].embedding
}
3. Vector Store (SQLite) (HIGH)
Pattern: Local SQLite database with JSON storage for embeddings
typescript
// lib/rag/store.ts
import Database from 'better-sqlite3'
import path from 'path'
import fs from 'fs'
const DB_PATH = path.join(process.cwd(), 'data', 'vector_store.db')
export interface StoredChunk {
id: string
text: string
source_file: string
embedding: number[]
metadata: string // JSON
}
/**
* Initialize SQLite database
*/
export function initDB(): Database.Database {
// Ensure data directory exists
const dataDir = path.dirname(DB_PATH)
if (!fs.existsSync(dataDir)) {
fs.mkdirSync(dataDir, { recursive: true })
}
const db = new Database(DB_PATH)
// Create table
db.exec(`
CREATE TABLE IF NOT EXISTS chunks (
id TEXT PRIMARY KEY,
text TEXT NOT NULL,
source_file TEXT NOT NULL,
embedding_json TEXT NOT NULL,
metadata_json TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
`)
// Create index on source_file for faster queries
db.exec(`
CREATE INDEX IF NOT EXISTS idx_source_file
ON chunks(source_file)
`)
return db
}
/**
* Insert chunk with embedding
*/
export function insertChunk(
db: Database.Database,
chunk: Chunk,
embedding: number[]
): void {
const stmt = db.prepare(`
INSERT INTO chunks (id, text, source_file, embedding_json, metadata_json)
VALUES (?, ?, ?, ?, ?)
`)
stmt.run(
chunk.id,
chunk.text,
chunk.source_file,
JSON.stringify(embedding),
JSON.stringify(chunk.metadata)
)
}
/**
* Get all chunks (for similarity search)
*/
export function getAllChunks(db: Database.Database): StoredChunk[] {
const stmt = db.prepare(`
SELECT id, text, source_file, embedding_json, metadata_json
FROM chunks
`)
const rows = stmt.all() as any[]
return rows.map(row => ({
id: row.id,
text: row.text,
source_file: row.source_file,
embedding: JSON.parse(row.embedding_json),
metadata: row.metadata_json
}))
}
/**
* Clear all chunks (for re-ingestion)
*/
export function clearDB(db: Database.Database): void {
db.exec('DELETE FROM chunks')
}
/**
* Get chunk count
*/
export function getChunkCount(db: Database.Database): number {
const result = db.prepare('SELECT COUNT(*) as count FROM chunks').get() as any
return result.count
}
4. Semantic Search (CRITICAL)
Pattern: Cosine similarity with top-K retrieval
typescript
// lib/rag/search.ts
export interface RetrievalResult {
text: string
source_file: string
similarity: number
metadata?: any
}
/**
* Calculate cosine similarity between two vectors
*/
export function cosineSimilarity(a: number[], b: number[]): number {
if (a.length !== b.length) {
throw new Error('Vectors must have same length')
}
let dotProduct = 0
let normA = 0
let normB = 0
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
const denominator = Math.sqrt(normA) * Math.sqrt(normB)
if (denominator === 0) return 0
return dotProduct / denominator
}
/**
* Search for most similar chunks
*/
export async function searchSimilar(
db: Database.Database,
queryEmbedding: number[],
topK: number = 3,
minSimilarity: number = 0.5
): Promise<RetrievalResult[]> {
// Get all chunks
const chunks = getAllChunks(db)
// Calculate similarities
const scored = chunks.map(chunk => ({
text: chunk.text,
source_file: chunk.source_file,
similarity: cosineSimilarity(queryEmbedding, chunk.embedding),
metadata: chunk.metadata
}))
// Filter by minimum similarity and sort
const results = scored
.filter(result => result.similarity >= minSimilarity)
.sort((a, b) => b.similarity - a.similarity)
.slice(0, topK)
return results
}
/**
* Retrieve relevant context for a query
*/
export async function retrieve(
query: string,
topK: number = 3
): Promise<RetrievalResult[]> {
const db = initDB()
try {
// Generate query embedding
const queryEmbedding = await generateEmbedding(query)
// Search for similar chunks
const results = await searchSimilar(db, queryEmbedding, topK)
return results
} finally {
db.close()
}
}
5. Anti-Hallucination System Prompt (CRITICAL)
Pattern: Strict constraints with explicit "I don't know" instructions
typescript
// lib/llm/prompts.ts
export const SYSTEM_PROMPT = `
Eres un asistente experto en Camaral, plataforma de humanos digitales y avatares de IA.
REGLAS CRÍTICAS - DEBES SEGUIR ESTAS REGLAS SIEMPRE:
1. CONTEXTO ES TU ÚNICA FUENTE DE VERDAD
- SOLO responde basándote en el CONTEXTO proporcionado
- NO uses conocimiento general o información externa
- Si algo no está en el contexto, di "No tengo información sobre eso"
2. PROHIBIDO INVENTAR
- NO inventes precios, costos o planes de suscripción
- NO menciones clientes que no estén en el contexto
- NO inventes métricas, estadísticas o números
- NO inventes integraciones o características técnicas
- NO prometas funcionalidades no mencionadas
3. TRANSPARENCIA CUANDO NO SABES
- Si no tienes información suficiente, dilo explícitamente
- Ejemplo: "No cuento con información sobre [tema] en mi base de conocimiento"
- Sugiere dónde pueden obtener más información (página web, contacto)
4. CITAS Y ATRIBUCIÓN
- Cuando sea posible, menciona la fuente: "Según [nombre del documento]..."
- Esto genera confianza y permite verificación
5. TONO Y ESTILO
- Profesional, claro y confiable
- Prioriza claridad sobre longitud
- Respuestas concisas pero completas
- Lenguaje accesible, no técnico-comercial excesivo
6. REDIRECCIÓN APROPIADA
- Si preguntan fuera del contexto de Camaral, redirige amablemente
- Ejemplo: "Soy un asistente especializado en Camaral. Para esa pregunta..."
RECUERDA: Es mejor decir "No sé" que inventar información incorrecta.
`.trim()
/**
* Build prompt with retrieved context
*/
export function buildPromptWithContext(
chunks: RetrievalResult[],
question: string
): string {
// Format context from retrieved chunks
const context = chunks
.map((chunk, i) => `
[Fuente ${i + 1}: ${chunk.source_file}]
${chunk.text}
`.trim())
.join('\n\n---\n\n')
return `${SYSTEM_PROMPT}
═══════════════════════════════════════════════════════
CONTEXTO PROPORCIONADO:
═══════════════════════════════════════════════════════
${context}
═══════════════════════════════════════════════════════
PREGUNTA DEL USUARIO:
═══════════════════════════════════════════════════════
${question}
═══════════════════════════════════════════════════════
TU RESPUESTA (basada SOLO en el contexto):
═══════════════════════════════════════════════════════
`
}
6. Complete RAG Pipeline (HIGH)
Pattern: End-to-end retrieval-augmented generation
typescript
// app/api/chat/route.ts
import { NextRequest, NextResponse } from 'next/server'
import { retrieve } from '@/lib/rag/search'
import { buildPromptWithContext } from '@/lib/llm/prompts'
import { OpenAI } from 'openai'
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
})
export async function POST(req: NextRequest) {
try {
const { message, history } = await req.json()
// 1. Retrieve relevant chunks from knowledge base
const chunks = await retrieve(message, 3)
console.log(`Retrieved ${chunks.length} chunks with similarities:`,
chunks.map(c => c.similarity.toFixed(3))
)
// 2. Build prompt with context
const systemPrompt = buildPromptWithContext(chunks, message)
// 3. Generate response with LLM
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemPrompt },
...history.slice(-10), // Last 10 messages for context
{ role: 'user', content: message }
],
temperature: 0.3,
max_tokens: 800
})
const response = completion.choices[0].message.content
// 4. Extract unique sources for attribution
const sources = [...new Set(chunks.map(c => c.source_file))]
// 5. Return response with sources
return NextResponse.json({
response,
sources,
metadata: {
model: 'gpt-4o-mini',
chunks_used: chunks.length,
avg_similarity: chunks.reduce((sum, c) => sum + c.similarity, 0) / chunks.length
}
})
} catch (error) {
console.error('RAG pipeline error:', error)
return NextResponse.json(
{ error: 'Failed to generate response' },
{ status: 500 }
)
}
}
Anti-Patterns
❌ Don't: Use keyword search instead of semantic search
typescript
// BAD: Simple string matching const relevantChunks = allChunks.filter(chunk => chunk.text.toLowerCase().includes(query.toLowerCase()) )
✅ Do: Use semantic embeddings
typescript
// GOOD: Semantic similarity const queryEmbedding = await generateEmbedding(query) const relevantChunks = await searchSimilar(db, queryEmbedding, topK)
❌ Don't: Send all documents as context
typescript
// BAD: Context too large, expensive
const allDocs = readAllMarkdownFiles()
const prompt = `Context: ${allDocs.join('\n\n')}\nQuestion: ${query}`
✅ Do: Retrieve only relevant chunks
typescript
// GOOD: Targeted, cost-effective const relevantChunks = await retrieve(query, 3) const prompt = buildPromptWithContext(relevantChunks, query)
Performance Tips
- •Batch embeddings - Process 100 texts per API call
- •Use smaller model - text-embedding-3-small is cheap and effective
- •Cache query embeddings - Same queries → reuse embeddings
- •Limit topK - 3-5 chunks usually sufficient
- •Add similarity threshold - Filter out low-relevance chunks (< 0.5)
- •Index frequently - Re-ingest when knowledge base changes
- •Monitor costs - Log embedding API calls
Testing
typescript
// Test chunking
describe('chunkDocument', () => {
it('should split by headers', () => {
const content = '## Section 1\nContent...\n## Section 2\nMore...'
const chunks = chunkDocument(content, 'test.md')
expect(chunks.length).toBeGreaterThan(0)
expect(chunks[0].metadata.section).toBe('Section 1')
})
})
// Test similarity
describe('cosineSimilarity', () => {
it('should return 1 for identical vectors', () => {
const v = [1, 2, 3]
expect(cosineSimilarity(v, v)).toBeCloseTo(1)
})
it('should return 0 for orthogonal vectors', () => {
expect(cosineSimilarity([1, 0], [0, 1])).toBeCloseTo(0)
})
})