RAG Implementation

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

Use this skill when

•Building Q&A systems over proprietary documents
•Creating chatbots with current, factual information
•Implementing semantic search with natural language queries
•Reducing hallucinations with grounded responses
•Enabling LLMs to access domain-specific knowledge
•Building documentation assistants
•Creating research tools with source citation

Do not use this skill when

•You only need purely generative writing without retrieval
•The dataset is too small to justify embeddings
•You cannot store or process the source data safely

Instructions

•Define the corpus, update cadence, and evaluation targets.
•Choose embedding models and vector store based on scale.
•Build ingestion, chunking, and retrieval with reranking.
•Evaluate with grounded QA metrics and monitor drift.

Safety

•Redact sensitive data and enforce access controls.
•Avoid exposing source documents in responses when restricted.

Core Components

1. Vector Databases

Purpose: Store and retrieve document embeddings efficiently

Options:

•Pinecone: Managed, scalable, fast queries
•Weaviate: Open-source, hybrid search
•Milvus: High performance, on-premise
•Chroma: Lightweight, easy to use
•Qdrant: Fast, filtered search
•FAISS: Meta's library, local deployment

2. Embeddings

Purpose: Convert text to numerical vectors for similarity search

Models:

•text-embedding-ada-002 (OpenAI): General purpose, 1536 dims
•all-MiniLM-L6-v2 (Sentence Transformers): Fast, lightweight
•e5-large-v2: High quality, multilingual
•Instructor: Task-specific instructions
•bge-large-en-v1.5: SOTA performance

3. Retrieval Strategies

Approaches:

•Dense Retrieval: Semantic similarity via embeddings
•Sparse Retrieval: Keyword matching (BM25, TF-IDF)
•Hybrid Search: Combine dense + sparse
•Multi-Query: Generate multiple query variations
•HyDE: Generate hypothetical documents

4. Reranking

Purpose: Improve retrieval quality by reordering results

Methods:

•Cross-Encoders: BERT-based reranking
•Cohere Rerank: API-based reranking
•Maximal Marginal Relevance (MMR): Diversity + relevance
•LLM-based: Use LLM to score relevance

Quick Start

python

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

# 5. Query
result = qa_chain({"query": "What are the main features?"})
print(result['result'])
print(result['source_documents'])

Advanced RAG Patterns

Pattern 1: Hybrid Search

python

from langchain.retrievers import BM25Retriever, EnsembleRetriever

# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, embedding_retriever],
    weights=[0.3, 0.7]
)

Pattern 2: Multi-Query Retrieval

python

from langchain.retrievers.multi_query import MultiQueryRetriever

# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=OpenAI()
)

# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")

Pattern 3: Contextual Compression

python

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
)

# Returns only relevant parts of documents
compressed_docs = compression_retriever.get_relevant_documents("query")

Pattern 4: Parent Document Retriever

python

from langchain.retrievers import ParentDocumentRetriever